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Preface 


The  20th  Syin{X>sium  on  the  Interface:  Computing  Science  and  Statistics  was  held  on  April 
20th  through  23rd,  1988.  The  20th  Symposium  on  the  Interface  was  in  a  number  of  senses  a  watershed 
event  for  the  Interface  series.  Begun  in  1967  in  Southern  California  as  a  one-day  workshop  meeting 
under  the  guidance  of  Arnie  Goodman  and  Nancy  Mann,  it  had  matured  over  the  years  to  a  rather 
large  scale  event.  The  Board  of  Governors  of  the  18th  Interface  appointed  an  ad  hoc  committee  to 
investigate  incorporation  of  the  Interface  to  preserve  its  financial  and  intellectual  independence.  At  the 
19th  Interface  a  plan  was  presented  by  one  of  us  (EJW)  which  included  bylaws  and  a  plan  for 
incorporation.  This  plan  was  approved  and  in  August  of  1987,  the  Interface  Foundation  of  North 
America,  Inc.  was  formed  as  a  non-profit,  education  corporation  as  the  legal  underpinning  for  the 
Interface  Symposium  series.  The  20th  Symposium  was  the  first  sp>onsored  under  the  Interface 
Foundation  banner.  It  was  an  auspicious  start  with  a  50%  increase  in  attendance,  with  the  number  of 
contributed  papers  nearly  doubled,  and  with  a  healthy  support  from  the  federal  funding  agencies. 

At  the  18th  Interface,  much  of  the  discussion  in  executive  session  focused  on  the  direction  of 
the  meetings.  The  vision  for  the  Interface  Symposia  obviously  drew  its  focus  from  the  interplay  of 
computer  science  and  statistics.  While  this  was  a  largely  unexplored  au'ea  in  1967,  the  interface,  in 
fact,  has  matured  substantially  and  many  of  us  thought  that  the  interface  was  simply  too  broad  and 
unfocused  to  remain  the  general  theme  of  the  Symposium.  The  19th  Interface  Symposium,  already 
well  underway  at  that  stage,  was  developed  around  the  theme  of  Large  Scale  SiatisUcal  Computing. 
The  20th  Interface  Symposium  was,  in  fact,  developed  from  first  blush  with  the  theme  of 
Computationally  Intensive  Statistical  Methods.  Much  of  factual  detail  about  the  20th  Symposium  is 
contained  in  the  front  pages  immediately  following  this  Preface,  e.g.  lists  of  people  involved,  the  past 
Interface  Symposia,  exhibitors,  cooperating  societies,  the  program  schedule  and  the  listing  of  papers  in 
the  technical  program.  We  hope  that  these  will  be  of  interest. 

We  have,  however,  broken  with  past  tradition  which  has  organized  the  Proceedings  around  the 
format  of  the  technical  program.  As  with  any  Symposium,  speakers  often  exhibit  variances  with  the 
formal  session  titles  in  which  they  are  scheduled  to  speak.  In  addition,  when  a  theme  is  announced,  it 
is  often  the  case  that  contributed  papers  closely  related  to  the  themes  of  invited  paper  sessions  are  also 
submitted.  We  felt  that,  the  Symposium  now  being  history,  it  would  be  better  to  organize  the 
Proceedings  around  the  logical  themes  of  the  papers  actually  submitted  for  the  Proceedings  and  making 
comparatively  little  distinction  between  what  were  invited  papers  and  contributed  papers.  The  clusters 
of  papers  are  our  choice  and  others  may  quibble  with  the  classifications  we  made.  Nonetheless,  we 
hope  that  the  organization  of  this  volume  makes  logical  sense  to  the  reader  and,  more  importantly, 
that  the  reader  finds  it  to  be  useful. 

Our  major  remaining  task  is  to  thank  those  p>eople  and  organizations  responsible  for  the  success 
of  the  meeting.  A  major  contributor  to  the  success  was  our  secretary,  Jan  P.  Guenther.  Many  of  the 
organization  details  that  are  attributed  to  the  Program  Chairman  were  in  fact  her  ideas  and  we  wish  to 
publicly  acknowledge  our  debt  to  her.  A  number  of  our  graduate  students,  notably  Masood 
Bolorforoush,  Hung  T.  Le,  Celesta  Ball  and  Dale  Penner  sptent  long  days  in  preparation  and  execution 
of  many  of  the  details.  We  also  would  like  to  acknowledge  the  patience  of  our  families,  notably  the 
Wegman  and  the  Guenther  families,  during  the  p>crparatory  phases  of  the  Sympx)sium.  The  co- 
op)erating  societies  and  organizations  should  be  acknowledged  as  well.  They  are  listed  later  in  the 
program.  Sp>ec;.d  note  should  be  made  of  the  Institute  of  Mathematical  Statistics,  the  National 
Computer  Graphics  Association,  the  American  Mathematical  Socictv  and  the  Society  for  Indust'-ial  and 
Applied  Mathematics,  each  of  which  provided  the  Sympx>siuni  organizers  with  tree  access  to  their 
membership  lists.  The  National  Bureau  of  Standards,  now  the  National  Institute  of  Standards  and 
Engineering,  printed  the  original  announcement  and  mailed  both  the  First  and  second  sets  of 
announcements. 

The  20th  Interface  Syrnpxwium,  as  haj  been  already  mentioned,  was  the  beneficiary  of  funding 
from  several  government  agencies  including  the  Air  Force  Office  of  Scientific  Research  under  grant 
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number  AFOSR-8&-0154,  the  Army  Research  Office  under  grant  number  DAAL03-88-G-0020,  the 
National  Science  Foundation  under  grant  number  DMS-8722898  and  the  Office  of  Naval  Research 
under  grant  number  N00014-88-J-1049.  The  editorial  work  of  EJW  on  this  volume  was  supported  by 
the  Air  Force  Office  of  Scientific  Research  under  grant  number  AFOSR-87-0179,  the  Army  Research 
Office  under  contract  number  DAAL03-87-K-0087,  the  National  Science  Foundation  under  grant 
number  DMS-8701931  and  the  Virginia  Center  for  Innovative  Technology  under  contract  number 
CIT/SPC-87-005-  The  latter  contract  also  supported  a  portion  of  Jan  Guenther’s  work. 

Edward  J.  Wegman 
Donald  T.  Gantz 
John  J.  Miller 
Fairfax,  Virginia 


Iv 


TABLE  OF  CONTENTS 


iii 

XV 

xvi 

xvii 
xvii 
xxviii 
xxix 
xxxvii 

1 

3 

11 

13 

45 

47 

49 

52 

62 

68 

74 

80 

86 

92 


Preface 

Program  Committee 
Past  Interface  Symposia 
Future  Interface  Symposia 
General  Infoi  mation 
Cooperating  Societies 
Program  Schedule 
Availability  of  Proceedings 


I.  KEYNOTE  ADDRESS 

Computer  Intensive  Statistical  Inference 
Bradley  Efron,  Stanford  University 


n.  SPECIAL  INVITED  PAPER 

Fitting  Functions  to  Noisy  Data  in  High  Dimensions 
Jerome  H.  Friedman,  Stanford  University 


m.  COMPUTATIONALLY  INTENSIVE  STATISTICAL  METHODS 

1 

Computational  Aspects  of  Bayesian  Methods 

A.F.M.  Smith,  University  of  Nottingham,  U.K. 

A  Bayesian  Approach  to  the  Design  and  Analysis  of  Computational  Experiments 
Toby  J.  Mitchell,  Max  D.  Morris,  Oak  Ridge  National  Laboratory 

Additive  Principal  Components:  A  Method  for  Estimating  Additive  Equations  with 
Small  Variance 

Deborah  J.  Donnell,  Bellcore 

Stochastic  Tests  of  Fit 

P.  Warwick  Millar,  University  of  California,  Berkeley 

Bootstrap  Inference  for  Replicated  Experiments 

Walter  Liggett,  National  Bureau  of  Standards 

Regression  Strategies 

David  Brownstone,  University  of  California,  Irvine 

Data  Sensitivity  Computation  for  Maximum  Likelihood  Estimation 

Daniel  C.  Chin,  Johns  Hopkins  University  Applied  Physics  Laboratory 

Bootstrap  Procedures  in  Random  Effect  Models  for  Comparing  Response  Rates  in 
Multi-Center  Clinical  Trials 

Michael  F.  Miller,  Hoechst-Roussel  Pharmaceuticals,  Inc. 

Bootstrapping  the  Mixed  Regression  Model  with  Reference  to  the  Capital  and 
Energy  Complementarity  Debate 

Baldev  Raj,  Wilfrid  Laurier  University 


IV.  STATISTICAL  GRAPHICS 


Dimensionality  Constraints  on  Projection  and  Section  Views  of  High  Dimensional 
Loci 

George  W.  Furnas,  Bell  Communications  Research 

A  Demonstration  of  the  Data  Viewer 

Catherine  Hurley.  University  of  Waterloo 

Visualizing  Multi-Dimensional  Geometry  with  Parallel  Coordinates 

Alfred  Inselberg,  IBM  Scientific  Center  and  University  of  Southern 
California;  Bernard  Dimsdale.  IBM  Scientific  Center 

On  Some  Graphical  Representations  of  Multivariate  Data 

Masood  Bolorforoush,  Edward  J.  Wegman,  George  Mason  University 

Graphical  Representations  of  Main  Effects  and  Interaction  Effects  in  a 
Pol3momial  Regression  on  Several  Predictors 

William  DuMouchel,  BBN  Software  Products  Corporation 


V.  COMPUTATIONAL  ASPECTS  OF  SIMULATED  ANNEALING 

Computational  Experience  with  Generalized  Simulated  Annealing 

Daniel  G.  Brooks,  William  A.  Verdini,  Arizona  State  University 

Simulated  Annealing  in  the  Construction  of  Exact  Optimal  Designs 

Ruth  K.  Meyer.  St.  Cloud  State  University;  Christopher  J.  Nachtsheim, 
University  of  Minnesota 

A  Simulated  Annealing  Approach  to  Mapping  DNA 

Larry  Goldstein,  Michael  S.  Waterman,  University  of  Southern  California 


VI.  PARALLEL  COMPUTING 

Modeling  Parallelism;  An  Interdisciplinary  Approach 

Elizabeth  A.  Unger,  Sallie  Keller-McNulty,  Kansas  State  University 

Asynchronous  Iteration 

William  F.  Eddy,  Mark  J.  Schervish,  Carnegie  Mellon  University 

Continuous  Valued  Neural  Networks:  Approximation  Theoretic  Results 
George  Cybenko,  University  of  Illinois  at  Urbana-Champaign 

Parameter  Identification  for  Stochastic  Neural  Systems 
Muhammad  K.  Habib,  George  Mason  University 

Statistical  Learning  Networks:  A  Unifying  View 

Andrew  R.  Barron,  University  of  Illinois;  Roger  L.  Barron,  Barron 
Associates,  Inc. 

Markov  Chains  Arising  in  Collective  Computation  Networks  with  Additive  Noise 
Robert  H.  Baron,  Naval  Surface  Warfare  Center 


209  Parallel  Optimization  Via  the  Block  Lanczos  Method 

Stephen  G.  Nash,  Ariela  Safer,  George  Mason  University 

214  A  Tool  to  Generate  Fortran  Parallel  Code  for  the  Intel  IPSC/2  Hypercube 
Carlos  Gonzalez,  J.  Chen,  J.  Sarma,  George  Mason  University 

220  Multiply  Twisted  N-Cubes  for  Parallel  Computing 

T.-H.  Shiau,  Paul  Blackwell,  Kemal  Efe,  University  of  Missouri-Columbia 

224  All-Subsets  Regression  on  a  Hypercube  Multiprocessor 
Peter  Wollan,  Michigan  Technological  University 

228  Testing  Parallel  Random  Number  Generators 

Mark  J.  Durst,  Lawrence  Livermore  National  Laboratory 


233  Vn.  DENSITY  AND  FUNCTION  ESTIMATION 

235  Interactive  Smoothing  Techniques 

Wolfgang  H'drdle,  University  of  Bonn 

241  Interactive  Multivariate  Density  Estimation  in  the  S  Language 
David  W.  Scott,  Mark  R.  Hall,  Rice  University 

246  Smoothing  Data  with  Correlated  Errors 

Naomi  S.  Altman,  Cornell  University 

254  Derivative  Estimation  by  Polynomial-Trigonometric  Regression 

Randy  Eubank,  Southern  Methodist  University;  Paul  Speckman,  University 
of  Missouri 

260  Efficient  Algorithms  for  Smoothing  Spline  Estimation  of  Functions  With  or 
Without  Discontinuities 

Jyh-Jen  Homg  Shiau,  University  of  Missouri-Columbia 

266  On  the  Consistency  of  a  Regression  Function  With  Local  Bandv/idth  Selection 
Ting  Yang,  University  of  Cincinnati 


271  vm.  SOFTWARE  TOOLS  FOR  STATISTICS 


273  Software  for  Bayesian  Analysis:  Current  Status  and  Additional  Needs-II 
Prem  K.  Gael,  Ohio  State  University 

282  An  Outline  of  Arizona 

John  Alan  McDonald,  University  of  Washington 

292  An  Illustration  of  Using  MACSYMA  for  Optimal  Experimental  Design 
Kathryn  Chaloner,  University  of  Minnesota 

298  An  Introduction  to  CART^^‘:  Classification  and  Regression  Trees 
Gerard  T.  LaVamway,  Norwich  University 

302  Generating  Code  for  Partial  Derivatives:  Some  Principles  and  Applications  to 
Statistics 

John  W.  Sawyer,  Jr.,  Texas  Tech 


vit 


307  Noise  Appreciation:  Analyzing  Residuals  Using  RS/Explore 

David  A.  Bum,  Fanny  L.  O'Brien,  BBN  Software  Products  Corporation 

313  An  Expert  System  for  Computer-Guided  Signal  Processing  and  Data  Analysis 
David  A.  Whitney,  Ilya  Schiller,  The  Analytic  Sciences  Corporation 


319  DC.  ARTIFICIAL  INTELLIGENCE.  EXPERT  SYSTEMS,  AND  STATISTICS 

321  PjTSSa — A  Time  Series  Analysis  System  Embedded  in  LISP 

Donald  B.  Percival,  R.  Keith  Kerr,  University  of  Washington 

331  Inside  a  Statistical  Expert  System:  Implementation  of  the  ESTES  System 
Paula  Hietala,  University  of  Tampere,  Finland 

336  The  Effect  of  Measurement  Error  in  a  Machine  Learning  System 

David  L.  Rumpf,  Mieczyslaw  M.  Kokar,  Northeastern  University,  Boston 

341  Knowledge-Based  Project  Management:  Work  Effort  Estimation 
Vijay  Kanabar,  University  of  Winnipeg 

346  Combining  Knowledge  Acquisition  and  Classical  Statistical  Techniques  in  the 
Development  of  a  Veterinary  Medical  Expert  System 

Mary  McLeish,  Matthew  Cecile,  University  of  Guelph;  Larry  Rendell, 
University  of  Illinois;  P.  Pascoe,  O.V.C.,  Guelph 

353  Methods  of  Approximate  Reasoning  in  Expert  Systems:  Computational 
Requirements 

Ambrose  Goicoechea,  George  Mason  University 

359  Algorithms  for  Paired  Comparison  Belief  Functions 

David  Tritchler,  Ontario  Cancer  Institute  and  University  of  Toronto; 
Gina  Lockwood,  Ontario  Cancer  Institute 

365  Fusion  and  Propagation  in  Graphical  Belief  Mod -Is 
Russell  Almond,  Harvard  University 

371  Variants  of  Tierney-Kadane 

G.  Weiss,  H.A.  Howlader,  University  of  Winnipeg 


377  X.  NUMERICAL  METHODS 

379  Numerical  Approach  to  Non-Gaussian  Smoothing  and  Its  Applications 
Genshiro  Kitagawa,  Institute  of  Statistical  Mathematics 

389  Interior  Point  Methods  for  Linear  Programming  Problems 

P.T.  Boggs,  P.D.  Domich,  J.R.  Donaldson,  C.  Witzgall,  National  Bureau  of 
Standards 

398  An  Application  of  Quasi-Newton  Methods  to  Parametric  Empirical  Bayes 
Estimation 

David  Scott,  University  of  Montreal 


404  Numerical  Algorithms  for  Exact  Calculations  of  Early  Stopping  Probabilities  in 
One-Sample  Clinical  Trials  with  Censored  Exponential  Responses 

Brenda  MacGibbon,  Concordia  University  and  University  of  Quebec  at 
Montreal;  Susan  Groshen,  University  of  Southern  California;  Jean-Guy 
Levreault,  University  of  Montreal 

410  A  Numerical  Comparison  of  EM  and  Quasi-Newton  Type  Algorithms  for 
Computing  MLE's  for  a  Mixture  of  Normal  Distributions 

John  W.  Davenport,  Margaret  Anne  Pierce,  Richard  J.  Hathaway, 
Georgia  Southern  College 

416  Higher  Order  Functions  in  Numerical  Programming 
David  S.  Gladstein,  ICAD  Inc. 

420  Theory  of  Quadrature  in  Applied  Probability:  A  Fast  Algorithmic  Approach 
Allen  Don,  Long  Island  University 

426  The  Probability  Integrals  of  the  Multivariate  Normal:  The  2^  Tree  and  the 
Association  Models 

Dror  Rom,  Merck  Sharp  &  Dohme;  Sanat  K.  Sarkar,  Temple  University 


433  XI,  STATISTICAL  METHODS 

435  Multiple-Smoothing  Parameters  in  Semiparametric  Multivariate  Model  Building 

Grace  Wahba,  University  of  Wisconsin-Madison 

442  Computing  Empirical  Likelihoods 

i4rr  Owen,  Stanford  University 

448  Computing  Extended  Maximum  Likelihood  Estimates  for  Linear  Parameter  Models 
Douglas  B.  Clarkson,  IMSL,  Inc.;  Robert  1.  Jennrich,  UCLA 

453  Simultaneous  Confidence  Intervals  in  the  General  Linear  Model 
Jason  C.  Hsu,  Ohio  State  University 

458  Assessment  of  Prediction  Procedures  in  Multiple  Regression  Analysis 
Victor  Kipnis,  University  of  Southern  California 

464  Posterior  Influence  Plots 

Robert  E.  Weiss,  University  of  Minnesota 

470  Exact  Power  Calculations  for  the  Chi-Square  Test  of  Two  Proportions 
Carl  E.  Pierchala,  U.S.  Department  of  Agriculture 

474  On  Covariances  of  Marginally  Adjusted  Data 

James  S.  Weber,  Roosevelt  University 

479  Optimizing  Linear  Functions  of  Random  Variables  Having  a  Joint  Multinomial  or 
Multivariate  Normal  Distribution 

J.P.  De  Los  Reyes,  University  of  Akron 

485  Approaches  for  Empirical  Bayes  Confidence  Intervals  for  a  Vector  of  Exponential 
Scale  Parameters 

Bradley  P.  Carlin,  Alan  E.  Gelfand,  University  of  Connecticut 


tx 


190  A  Data  Analysis  and  Bayesian  Framework  for  Errors-in-Variables 
John  H.  Herbert,  U.S.  Department  of  Energy 

500  The  Effect  of  Low  Covariate-Criterion  Correlations  on  the 
Analysis-of-Covariance 

Michael  J.  Rovine,  Alexander  von  Eye,  Phillip  Wood,  Pennsylvania  State 
University 

505  Estimation  of  the  Variance  Matrix  for  Maximum  Likelihood  Parameters  Using 
Quasi-Newton  Methods 

Linda  Williams  Pickle,  National  Cancer  Institute;  Garth  P.  McCormick, 
George  Washington  University 

511  Application  of  Posterior  Approximatio^i  Techniques  to  the  Ordered  Dirichlet 
Distribution 

Thomas  A.  Mazzuchi,  Refik  Soyer,  George  Washington  University 

516  Comparison  of  "Local  Model"  Statistical  Classification  Methods 
Daniel  Normolle,  University  of  Michigan 

522  An  Example  of  the  Use  of  a  Bayesian  Interpretation  of  MDA  Results 
James  R.  Nolan,  Siena  College 

524  Unbiased  Estimates  of  Multivariate  General  Moment  Functions  of  the  Population 
and  Application  to  Sampling  Without  Replacement  from  a  Finite  Population 
Nabih  N.  Mikhail,  Liberty  University 


529  Xn.  COMPUTATIONAL  DISCRETE  MATHEMATICS 

531  Discrete  Structures  and  Reliability  Computations 

D.E.  Whited,  Lincoln  Laboratories;  D.R.  Shier,  College  of  William  and 
Mary;  J.P.  Jarvis,  Clemson  University 

538  Determining  Properties  of  Minimal  Spanning  Trees  by  Local  Sampling 

William  F.  Eddy,  Carnegie  Mellon  University;  Allen  A.  McIntosh,  Bellcore 

546  Matrix  Completions,  Determinantal  Maximization,  and  Maximum  Entropy 

Charles  R.  Johnson,  College  of  William  and  Mary;  Wayne  W.  Barrett, 
Brigham  Young  University 

553  Algorithms  to  Reconstruct  a  Convex  Set  from  Sample  Points 

Marc  Moore,  ^cole  Poly  technique.  Montrial  and  McGill  University;  Yves 
Lemay,  Bell  Canuda;  S.  Archanibault,  tcole  Poly  technique,  Montreal 

559  Applications  of  Orthogonalization  Procedures  to  Fitting  Tree-Structured  Models 
Cynthia  O.  Siu,  Johns  Hopkins  University 

565  A  Stochastic  Extension  of  Petri  Net  Graph  Theory 
Lisa  Anneberg,  Wayne  State  University 

568  Timed  Neural  Petri  Net 

Nazih  Chamas,  Harpreet  Singh,  Wayne  State  University 


X 


573  xm.  SIMULATION 

575  Estimating  Standard  Errors:  Empirical  Behavior  of  Asymptotic  MSE-Optimal 
Batch  Sizes 

Wheyming  Tina  Song,  Bruce  Schmeiser,  Purdue  University 

581  SIMEST  and  SIMDAT:  Differences  and  Convergences 

E.  Neely  Atkinson,  Barry  W.  Brown,  James  R.  Thompson,  M.D.  Anderson 
Research  Center  and  Rice  University 

587  Acceleration  Methods  for  Monte  Carlo  Integration  in  Bayesian  Inference 
John  Geweke,  Duke  University 

593  Mixture  Experiments  and  Fractional  Factorials  Used  to  Tailor  Computer 
Simulations 

Turkan  K.  Gardenier,  TKG  Consultants,  Ltd. 

599  Simulation  and  Stochastic  Modeling  for  the  Spatial  Allocation  of 
Multi-Categorical  Resources 

Richard  S.  Segall,  University  of  Lowell 

603  A  Monte  Carlo  Assessment  of  Cross-Validation  and  the  Cp  Criterion  for  Model 
Selection  in  Multiple  Linear  Regression 

Robert  M.  Boudreau,  Virginia  Commonwealth  University 

608  It's  Time  to  Stop 

Hubert  Lilliefors,  George  Washington  University 

612  Simulating  Stationary  Gaussian  ARMA  Time  Series 
Terry  J.  Woodfield,  SAS  Institute  Inc. 

618  On  Comparative  Accuracy  of  Multivariate  Nonnormal  Random  Number 
Generators 

Lynne  K.  Edwards,  University  of  Minnesota 

624  Robustness  Study  of  Some  Random  Variate  Generators 
Lih-Yuan  Deng,  Memphis  State  University 

627  A  Ratio-of-Uniforms  Method  for  Generating  Exponential  Power  Variates 

Dean  M.  Young,  Danny  W.  Turner,  Baylor  University;  John  W.  Seaman, 
Jr.,  University  of  Southwestern  Louisiana 

630  An  Approach  for  Generation  of  Two  Variable  Sets  with  a  Specified  Correlation 
and  First  and  Second  Sample  Moments 

Mark  Eakin,  Henry  D.  Crockett,  C.S.P. 


633  XIV.  ROBUST  AND  NONPARAMETRIC  METHODS 

635  Gamma  Processes,  Paired  Comparisons  and  Ranking 
Hal  Stern,  Harvard  University 

640  A  Modular  Nonparametric  Approach  to  Model  Selection 

Michael  E.  Tarter,  Michael  D.  Lock,  University  of  California,  Berkeley 


xi 


650  Robustness  of  Weighted  Estimators  of  Location;  A  Small-Sample  Study 

Gregory  Campbell,  Richard  I.  Shrager,  National  Institutes  of  Health 

656  Approximations  of  the  Wilcoxon  Rank  Sum  Test  in  Small  Samples  with  Lots  of 
Ties 

Arthur  R.  Silverberg,  U.S.  Food  and  Drug  Administration 

662  A  Comparison  of  Spearman's  Footrule  and  Rank  Correlation  Coefficient  with 
Exact  Tables  and  Approximations 

LeRoy  A.  Franklin,  Indiana  State  University 

666  The  Effects  of  Heavy  Tailed  Distributions  on  the  Two-Sided  k-Sample.  Smirnov 
Test 

Henry  D.  Crockett,  MM.  Whiteside,  University  of  Texas  at  Arlington 

669  Simulated  Power  Comparisons  of  MRPP  Rank  Tests  and  Some  Standard  Score 
Tests 

Derrick  S.  Tracy,  Khushnood  A.  Khan,  University  of  Windsor 

675  Performance  of  Several  One  Sample  Procedures 

David  L.  Turner,  YuYu  Wang,  Utah  State  University 

681  XV.  TIME  SERIES  ANALYSIS 

683  Computational  Aspects  of  Harmonic  Signal  Detection 

Keh-Shin  Lii,  Tai-Houn  Tsou,  University  of  California,  Riverside 

689  Time  Series  in  a  Microcomputer  Environment 

John  D.  Henstridge,  Perth,  Western  Australia 

693  Moving  Window  Detection  for  0-1  Markov  Trials 

Joseph  Glaz,  Philip  C.  Hormel,  Bruce  McK.  Johnson,  University  of 
Connecticut  and  CIBA-GEIGY  Corporation 

699  Inference  Techniques  for  a  Class  of  Exponential  Time  Series 

V.  Chandrasekar,  Colorado  State  University;  P.J.  Brockwell,  University 
of  Melbourne,  Australia 

704  Alternative  Methods  for  Computing  the  Theoretical  Autocovariance  Function  of 
Multivariate  ARMA  Processes:  A  Comparison 
Stefan  Mittnik,  SUNY  at  Stony  Brook 


709  XVI.  RELIABILITY  AND  LIFE  DISTRIBUTIONS 

711  Increasing  Reliability  of  Multiversion  Fault-Tolerant  Software  Design  by 
Modulation 

Junryo  Miyashita,  California  State  University  at  San  Bernardino 

716  Linear  Prediction  of  Failure  Times  of  a  Repairable  System 
M.  Ahsanullah,  Rider  College 

719  The  Simulation  of  Life  Tests  with  Random  Censoring 

Joseph  C.  Hudson,  GMI  Engineering  &  Management  Institute 


xii 


725  An  Identifiable  Model  for  Informative  Censoring 

William  A.  Link,  U.S.  Fish  and  Wildlife  Service 


729  XVn.  APPLICATIONS 

731  Nonparametric  Regression  and  Spatial  Data:  Some  Experiences  Collaborating 
with  Biologists 

Douglas  Nychka,  North  Carolina  State  University 

737  Space  Balls!  Or  Estimating  the  Diameter  Distribution  of  Monosize  Polystyrene 
Microspheres 

Susannah  B.  Schiller,  National  Bureau  of  Standards 

743  Maximum  Queue  Size  and  Hashing  with  Lazy  Deletion 

Claire  M.  Mathieu,  Princeton  University;  Jeffrey  Scott  Fitter,  Brown 
University 

749  Classifying  Linear  Mixtures,  with  an  Application  to  High  Resolution  Gas 
Chromatography 

William  S.  Rayens,  University  of  Kentucky 

755  Bias  of  Animal  Population  Trend  Estimates 

Paul  H.  Geissler,  William  A.  Link,  U.S.  Fish  and  Wildlife  Service 

760  1  he  Elimination  of  Quantization  Bias  Using  Dither 

Douglas  M.  Dreher,  Martin  J.  Garbo,  Hughes  Aircraft  Company 

764  An  Alternate  Methodology  for  Subject  Database  Planning 

Henry  D.  Crockett,  Mark  E.  Eakin,  Craig  W.  Slinkman,  University  of 
Texas  at  Arlington 

769  Sensitivity  Analysis  of  the  Herfindahl-Hirschman  Index 
James  R.  Knaub,  Jr.,  U.S.  Department  of  Energy 

771  Encoding  and  Processing  of  Chinese  Language — A  Statistical  Structural  Approach 
Chaiho  C.  Wang,  U.S.  Department  of  Justice  and  George  Washington 
University 


ni  XVra.  BIOSTATISTICAL  METHODS 

779  An  Algorithm  to  Identify  Changes  in  Hormone  Patterns 

Morton  B.  Brown,  Fred  J.  Karsch,  Benoit  Malpaux,  University  of  Michigan 

785  Optimization  in  the  Design  of  Sequential  Clinical  Trials 
Richard  Simon,  National  Cancer  Institute 

789  Bayes  Estimation  of  Cerebral  Metabolic  Rate  of  Glucose  in  Stroke  Patients 
P.  David  Wilson,  University  of  South  Florida;  Sung  Cheng  Huang, 

Randall  A.  Hawkins,  UCLA  School  of  Medicine 

795  Estimation  of  Death  Density  Using  Grouped  Census  and  Vital  Statistics  Data 
John  J.  Hsieh,  University  of  Toronto 


xili 


801 


Extracting  Records  from  New  Jersey's  Multiple  Cause  of  Death  Files 
Giles  Crane,  New  Jersey  Department  of  Health 


805  XDC.  IMAGE  PROCESSING  , 

807  A  Probabilistic  Approach  to  Range  Data  Segmentation 

Ezzet  Al-Hujazi,  Wayne  State  University;  Arun  Sood,  George  Mason 
University 

812  Compression  of  Image  Data  Using  Arithmetic  Coding 

Ahmed  Desoky,  Carol  O'Connor,  Thomas  Klein,  University  of  Louisville 

816  Image  Analysis  of  the  Microvascular  System  in  the  Rat  Cremaster  Muscle 

Carol  O'Connor,  Ahmed  Desoky,  Cathy  Senft,  Patrick  Harris,  University 
of  Louisville 

822  An  Empirical  Bayes  Decision  Rule  of  Two-Class  Pattern  Recognition 
Tze  Fen  Li,  Dinesh  S.  Bhoj,  Rutgers  University 

824  Statistical  Modeling  of  a  Priori  Information  for  Image  Processing  Problems: 

A  Mathematical  Expression  of  Images 

Z.  Liang,  Duke  University  Medical  Center 


833  Appendix  A:  List  of  Paid  Registrants 
859  Appendix  B:  Author  Index 


xiv 


Symposium  Chairman 


Edward  J.  Wegman 
Center  for  Computational  Statistics 
George  Mason  University 
Fairfax,  VA  22030 
(703)  323-2723 

EMAIL:  EWEGMAN@GMUVAX  (bitnet)  or 
EWEGMAN@GMUVAX.GMU.EDU  (arpanet) 


Symposium  Coordinator  and  Exhibit  Manager 


Jan  P.  Guenther 

Center  for  Computational  Statistics 
George  Mason  University 
Fairfax,  VA  22030 
(703)  764-6170 


Program  Committee 


David  Allen 

University  of  Kentucky 

John  Miller 

George  Mason  University 

Chris  Brown 

University  of  Rochester 

Mervin  Muller 

Ohio  State  University 

Martin  Fischer 

Defense  Communication  Engineering  Center 

Stephen  Nash 

George  Mason  University 

Donald  T.  Gantz 

George  Mason  University 

Emanuel  Parzen 

Texas  A  and  M  University 

Prem  K.  Goel 

Ohio  State  University 

Richard  Ringeisen 

Clemson  University 

Muhammed  Habib 

University  of  North  Carolina 

Jerry  Sacks 

University  of  Illinois 

Mark  E.  Johnson 

Los  Alamos  National  Laboratory 

David  Scott 

Rice  University 

Sallie  Keller-McNulty 

Kansas  State  University 

Nozer  Singpurwalla 

George  Washington  University 

Raoul  LePage 

Michigan  State  University 

Werner  Stuetzle 

University  of  Washington 

Don  McClure 

Brown  University 

Paul  Tukey 

Bell  Communications  Research 

XV 


Past  Interface  Symposia 


Southern  California,  1967,  1968 
1st  and  2nd  Symposia 

Southern  California,  1969 
3rd  Symposium 

Southern  California,  1971 
4th  Symposium 

Oklahoma  State  University,  1972 
5th  Symposium 

University  of  California,  Berkeley,  1973 
6th  Symposium 

Iowa  State  University,  1974 
7th  Symposium 

University  of  California,  Los  Angeles,  1975 
8th  Symposium 

Harvard  University,  1976 
9th  Symposium 

National  Bureau  of  Standards,  1977 
10th  Symposium 

North  Carolina  State  University,  1978 
11th  Symposium 

University  of  Waterloo,  1979 
12th  Symposium 

Carnegie- Mellon  University,  1981 
13th  Symposium 

Rensselaer  Polytechnic  Institute,  1982 
14th  Symposium 

IMSL,  Inc  (held  in  Houston),  1983 
15th  Symposium 

University  of  Georgia  (held  in  Atlanta),  1984 
16th  Symposium 

University  of  Kentucky,  1985 
17th  Symposium 


Chair:  Nancy  Mann 


Chair:  Ed  Robinson 

Chair:  Mike  Tarter 

Keynote  Speakers:  Richard  Hamming  and 
Frank  Anscombe 

Chair:  Mitchell  O.  Locks 
Keynote  Speaker:  H.  O.  Hartley 

Chair:  Michael  Tarter 
Keynote  Speaker:  John  Tukey 

Chair:  William  J.  Kennedy 
Keynote  Speaker:  Martin  Wilk 

Chair:  James  W.  Frane 
Keynote  Speaker:  Edwin  Kuh 

Chairs:  David  Hoaglin  and 
Roy  E.  Welsch 

Keynote  Speaker:  John  R.  Rice 

Chair:  David  Hogben 

Keynote  Speaker:  Anthony  Ralston 

Chairs:  Ron  Gallant  and 
Thomas  Gerig 

Keynote  Speaker:  Nancy  Mann 

Chair:  Jane  F.  Gentleman 
Keynote  Speaker:  D.  R.  Cox 

Chair:  William  F.  Eddy 
Keynote  Speaker:  Brad  Efron 

Chairs:  John  W.  Wilkinson, 

Karl  W.  Heiner  and  Richard  Sacher 
Keynote  Speaker:  John  Tukey 

Chair:  James  Gentle 

Keynote  Speaker:  Richard  Hamming 

Chair:  Lynne  Billard 

Keynote  Speaker:  George  Marsalgia 

Chair:  David  Allen 

Keynote  Speaker:  John  C.  Nash 


xvi 


Past  Interface  Symposia  (Continued) 


Colorado  State  University,  1986 
18th  Symposium 

Temple  University  (held  in  Philadelphia),  1987 
19th  Symposium 


George  Mason  University,  1988 
20th  Symposium 


University  of  South  Florida,  1989 
21st  Symposium 

Michigan  State  University,  1990 
22nd  Symposium 


Chair:  Thomas  Boardman 
Keynote  Speaker:  John  Tukey 

Chair:  Richard  Heiberger 
Keynote  Speaker:  Gene  Golub 

Chair:  Edward  J.  Wegman 
Keynote  Speaker:  Brad  Efron 


Future  Interface  Symposia 

Chairs:  Ken  Berk  and  Linda  Malone 


Chair:  Raoul  LePage 


General  Information 


The  20th  Symposium  represents  a  milestone  in  the  development  of  the  interf£u;e  between 
computing  science  and  statistics.  In  August,  1987  the  Interface  Foundation  of  North  America  was 
incorporated  as  a  non-profit,  educational  corporation  whose  main  charter  is  to  provide  the  legal  entity 
underpinning  the  Symposium  series.  The  Foundation  represents  a  maturation  of  the  Symposium  series 
and  ensures  its  continuation  as  an  independent  meeting  focused  on  the  interface.  The  20th  Symposium 
is  the  first  held  under  the  auspices  of  the  Foundation. 

Theme:  —  Computationally  Intensive  Statistical  Methods 

Keynote  Address:  —  “Computationally  intensive  statistical  inference” 

Bradley  Efron,  Department  of  Statistics,  Stanford  University 

Invited  Papers:  —  There  are  60  invited  papers  including  several  with  invited  discussion  organized  into 
23  sessions.  In  addition  to  the  plenary  session  with  the  keynote  address  by  Brad  Efron,  there  are  three 
special  invited  lectures  featuring  Jerome  Friedman,  George  E.  P.  Box  and  Thomas  Banchoff. 

Contributed  Papers:  —  There  are  128  contributed  papers  scheduled  in  26  sessions. 

Exhibitors 


Ametek  Computer  Corporation 
606  East  Huntington  Drive 
Monrovia,  CA  91016 
(714)  599-4662 

Automatic  Forec^lsting  Systems,  Inc. 
P.  0.  Box  563 
Hatboro,  PA  19040 
(215)  675-0652 


North  Holland/Elsevier  Publishers 
P.  O.  Box  1991 
1000  BZ  Amsterdam 
The  Netherlands 

Numerical  Algorithms  Group 
1101  31st  Street,  Suite  100 
Downers  Grove,  IL  60515 
(312)  971-2337 


xvll 


BBN  Software 

Springer-Verlag,  Inc. 

10  Fawcett  Street 

175  Fifth  Avenue 

Cambridge,  MA  02238 

New  York,  NY  10010 

(617)  873-8116 

(212)  460-1600 

BMDP  Statistical  Software,  Inc. 

SYSTAT,  Inc. 

1440  Sepulveda  Boulevard,  Suite  316 

1800  Sherman  Avenue 

Los  Angeles,  CA  90025 

Evanston,  IL  60201 

(213)  479-7799 

(312)  864-5670 

Intel  Scientific  Computers 

TCI  Software 

15201  NW  Greenbrier  Parkway 

1190  Foster  Rorid 

Beaverton,  OR  97006 

Las  Cruces,  NM  88001 

(503)  629-7631 

(505)  522-4600 

Marcel-Dekker,  Inc. 

Tektronix,  Inc. 

270  Madison  Avenue 

M.S.  48-300,  Industrial  Park 

New  York,  NY  10016 

Beaverton,  OR  97077 

(212)  696-9000 

(503)  627-7111 

IMSL,  Inc. 

Wadsworth  &  Brooks/ Cole 

2500  ParkWest  Tower  One 

Advanced  Books  and  Software 

2500  CityWest  Boulevard 

10  Davis  Drive 

Houston,  TX  77042-3020 

Belmont,  CA  94002 

(713)  782-6060 

(415)  595-2350 

Short  Course 

Forecasting  on  the  IBM-PC  -  A  Survey,  Wednesday,  April  20,  9:00  a.m.  to  4:30  p.m.,  David  P.  Reilly, 

Automatic  Forecasting  Systems,  Inc.,  P. 

0.  Box  563,  Hatboro,  PA  19040,  (215)  675-0652 

American  Mathematical  Society 

Cooperating  Societies 

Operations  Research  Society  of  America 

P.  0.  Box  6248 

Mount  Royal  and  Guilford  Avenues 

Providence,  RI  02940 

Baltimore,  MD  21202 

American  Statistical  Association 

Society  for  Industrial  and  Applied  Mathematics 

1429  Duke  Street 

1400  Architects  Building 

Alexandria,  VA  22314 

117  South  17th  Street 

Philadelphia,  PA  19103 

International  Association  for  Statistical  Computing 

NTDH 

Virginia  Academy  of  Science  Chapter  of  ASA 

P.  0.  Box  145 

c/o  Golde  I.  Holtzman 

N-7701  Steinkjer 

Department  of  Statistics 

Norway 

Virginia  Tech 

Institute  of  Mathematical  Statistics 

3401  Investment  Boulevard,  Suite  7 

Blacksburg,  VA  24061 

Washington  Statistical  Society 

Hayward,  CA  94545 

P.  0.  Box  70843 

National  Computer  Graphics  Association 
2722  Merilee,  Suite  200 

Fairfax,  VA  22031 

Washington,  DC  20024-0843 

xvlH 


Program  Schedule 


Date  and  Time 

Thursday,  April  21 
8:45  a.m.  -  9:45  a.m. 

10:00  a.m.  -  12:00  noon 


1:30  p.m.  -  3:30  p.m. 


3:45  p.m.  -  5:45  p.m. 


Friday,  April  22 
8:00  a.m.  -  10:00  p.m. 


10:15  a.m.  -  12:15  p.m. 


Session  Title 


Keynote  Address:  Computationally  Intensive  Statistical 
Inference 

Computational  Aspects  of  Time  Series  Analysis 
Inference  and  Artificial  Intelligence 
Computational  Discrete  Mathematics 
Contributed:  Software  Tools 
Contributed:  Image  Processing  I 

Contributed:  Bootstapping  and  Related  Computational 
Methods 

Special  Invited  Lecture  I 
Image  Processing  and  Spatial  Processes 
Parallel  Computing  Architectures 
Contributed:  Statistical  Methods  I 
Contributed:  Hardware  and  Software  Reliability 
Contributed:  Applications  I 

Special  Invited  Session  for  Recent  Ph.D.’s 
Simulation 

Symbolic  Computation  and  Statistics 
Contributed:  Statistical  Graphics 
Contributed:  Models  of  Imprecision  in  Expert  Systems 
Contributed:  Time  Series  Methods 


Computer-Communication  Networks 
Supercomputing,  Design  of  Experiments  and  Bayesian 
Analysis,  Part  1 

Numerical  Methods  in  Statistics 

Contributed:  Probability  and  Stochastic  Processes 

Contributed:  Statistical  Methods  II 

Contributed:  Nonparametric  and  Robust  Techniques 

Sjjecial  Invited  Lecture  II 

Supercomputing,  Design  of  Experiments  and  Bayesian 
Analysis,  Part  2 
Neural  Networks 
Contributed:  Applications  II 
Contributed:  Image  Processing  II 
Contributed:  Simulation  I 


xix 


2:00  p.m.  -  4:00  p.m. 


Tales  of  the  Unexpected:  Successful 
Interdisciplinary  Research 
Density  Estimation  and  Smoothing 
Object  Oriented  Programming 
Contributed:  Numerical  Methods 
Contributed:  Bayesian  Methods 
Contributed:  Expert  Systems  in  Statistics 

Saturday,  April  23 

8:30  a.m.  -  10:30  a.m.  Computational  Aspects  of  Simulated  Annealing 

Dynamical  High  Interaction  Graphics 
Contributed:  Statistical  Methods  III 
Contributed:  Simulation  II 
Contributed:  Biostatistics  Applications 
Contributed:  Discrete  Mathematical  Methods 

10:45  a.m.  -  12:45  p.m.  Special  Invited  Lecture  III 

Entropy  Methods 

Contributed:  Information  Systems,  Databases  and  Statistics 
Contributed:  Parallel  Computing 
Contributed:  Density  and  Function  Estimation 
Contributed:  Statistical  Methods  IV 


Technical  Program 
WEDNESDAY,  APRIL  20,  1988 

9:00  a.m.  -  4:30  p.m. 

Short  Course  -  Forecasting  on  the  IBM-PC,  David  Reilly,  Automatic  Forecasting  Systems, 
Inc. 

THURSDAY,  APRIL  21,  1988 

8:45  a.m.  -  9:45  a.m. 

Plenary  Session,  Chaired  by:  Edward  J.  Wegman,  George  Mason  University 
“Computationally  intensive  statistical  inference,”  Breidley  Efron,  Stanford  University 
10:00  a.m.  -  12:00  noon 

Computational  Aspects  of  Time  Series  Analysis,  Chaired  by:  Emanuel  Parzen, 

Texas  A  &  M  University 

“Recent  progress  in  algorithms  and  architectures  for  time  series  analysis,”  George  Cybenko, 
Tufts  University 

“Numerical  approach  to  non-gaussian  smoothing  and  its  application,”  Gcnshiro  Kitagawa, 
The  Institute  of  Statistical  Mathematics 

Discussants  -  Will  Gersch,  University  of  Hawaii  and  H.  Joseph  Newton,  Texas  A  M 
University 


XX 


THURSDAY,  APRIL  21,  1988 


10:00  a.in.  -  12:00  noon 

Inference  and  Artificial  Intelligence,  Chaired  by:  N.  Singpurwalla,  George  Washington 
University 

“Spectral  Analysis  on  a  LISP  machine,”  Don  Percival,  University  of  Washington 

“DeFinetti’s  approach  to  group  decision  making,”  Richard  Barlow,  University  of  California, 
Berkeley 

“Meta-analysis,”  Ingram  Olkin,  Stanford  University 
10:00  a.m.  -  12:00  noon 

Computational  Discrete  Mathematics,  Chaired  by:  Rich  Ringeisen,  Clemson  University 

“Discrete  structures  and  reliability  computations,”  James  P.  Jarvis,  Clemson  University 
and  Douglas  R.  Shier,  College  of  William  and  Mary 

“Random  graphs,”  Edward  R.  Scheinerman,  The  Johns  Hopkins  University 
“Structure  and  finiteness  conditions  on  graphs,”  Neil  Robertson,  Ohio  State  University 
10:00  a.m.  -  12:00  noon 

Contributed  Papers:  Software  Tools,  Chaired  by:  Leonard  Hearne,  George  Mason 
University 

“An  introduction  to  CART*'":  classification  and  regression  trees,”  Gerard  T.  LaVarnway, 
Norwich  University 

“Noise  appreciation:  analyzing  residuals  using  RS/Explore,”  David  A.  Burn  and  Fanny 
O’Brien,  BBN  Software  Products  Corproration 

“COSTAR:  an  environment  for  computer-guided  data  analysis,”  David  A.  Whitney  and 
Ilya  Schiller,  TASC 

“A  closer  look  at  symbolic  computation,”  William  M.  Makuch,  General  Electric  Corporation 
and  John  W.  Wilkinson,  Rensselaer  Polytechnic  Institute 

10:00  a.m.  -  12:00  noon 

Contributed  Papers:  Image  Processing  I,  Chaired  by:  A.  K.  Sood,  George  Mason  University 

“Image  analysis  of  a  turbulent  object  using  fractal  parameters,”  Amar  Ait-Kheddache, 

North  Carolina  State  University 

“Identification  of  closed  figures,”  Jeff  Banfield,  Montana  State  University  and  Adrian 
Raftery,  University  of  Washington 

“Compression  of  image  data  using  arithmetic  coding,”  Ahmed  H.  Desoky  and  Thomcis 
Klein,  University  of  Louisville 
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“Image  analysis  of  the  microvascular  system  in  the  rat  cremaster  muscle,”  C.  O’Connor, 

P.  D.  Harris,  A.  Desoky  and  G.  Ighodaro,  University  of  Louisville 

“Automatic  detection  of  the  optic  nerve  in  color  images  of  the  retina,”  Norman  Katz, 
Subhasis  Chaudhuri,  and  Michael  Goldbaum,  University  of  California,  San  Diego  and 
Mark  Nelson,  Radford  Company 

10:00  a.m.  -  12:00  noon 

Contributed  Papers:  Bootstrapping  and  Related  Computational  Methods,  Chaired  by: 
Richard  Bolstein,  George  Mason  University 

“A  Monte  Carlo  study  of  cross-validation  and  the  Cp  criterion  for  model  selection  in 
multiple  linear  regression,”  Robert  M.  Boudreau,  Virginia  Commonwealth  University 

“Bootstrapping  regression  strategies,”  David  Brownstone,  University  of  California,  Irvine 

“Bootstrapping  the  missed  regression  model  with  reference  to  the  capital  and  energy 
complementarity  debate,”  Baldev  Raj,  Wilfred  Laurier  University 

“Efficient  data  sensitivity  computation  for  maximum  likelihood  estimation,”  Daniel  Chin 
and  James  C.  Spall,  The  Johns  Hopkins  University 

“Bootstrap  procedures  in  random  etfect  models  for  comparing  response  rates  in  multi-center 
clinical  trials,”  Michael  F.  Miller,  Hoechst- Roussel  Pharmaceuticals,  Inc. 

1:30  p.m.  -  2:45  p.m. 

Special  Invited  Lecture  I,  Chaired  by:  Jim  Filliben,  National  Bureau  of  Standards 

“Fitting  functions  to  scattered  noisy  data  in  high  dimensions,”  Jerome  Friedman, 

Stanford  University 

1:30  p.m.  -  3:30  p.m. 

Image  Processing  and  Spatial  Processes,  Chaired  by:  Don  McClure,  Brown  University 
Introduction,  Don  McClure,  Brown  University 

“A  multilevel-multiresolution  technique  for  image  analysis  and  robot  vision  via 
renormalization  group  ideas,”  Basilis  Gidas,  Brown  University 

“A  mathematical  approach  to  expert  system  construction,”  Alan  Lippman,  Brown 
University 


xxii 


THURSDAY,  APRIL  21,  1988 


1:30  p.m.  -  3:30  p.m. 

Parallel  Computing  Architectures,  Chaired  by:  Chris  Brown,  University  of  Rochester 

“Experiences  with  the  BBN  Butterfly*”*  parallel  processor,”  John  Mellor-Crummy, 

University  of  Rochester 

“Statistical  computing  on  a  hyi>ercube,”  George  Ostrouchov,  Oak  Ridge  National  Lab 
“Asychronous  iteration,”  William  F.  Eddy  and  Mark  Schervish,  Carnegie-Mellon  University 
1:30  p.m.  -  3:30  p.m. 

Contributed  Pa{)ers:  Statistical  Methods  1,  Chaired  by:  Walter  Liggett,  National  Bureau  of 
Standards 

“An  example  of  the  use  of  a  Bayesian  interpretation  of  multiple  discriminant  analysis 
results,”  James  R.  Nolan,  Siena  College 

“Real-time  classification  and  discrimination  among  components  of  a  mixture  distribution,” 
Douglas  A.  Samuelson,  International  Telesystems  Corporation 

“Comparison  of  three  ‘local  model’  classification  methods,”  Daniel  Normolle,  University  of 
Michigan 

“Application  of  posterior  approximation  techniques  for  the  ordered  Dirichlet  distribution.” 
Thomas  A.  Mazzuchi  and  Refik  Soyer,  George  Washington  University 

“Unbiased  estimates  of  multivariate  general  moment  functions  of  the  population  and 
application  to  sampling  without  replacement  for  a  finite  population,”  Nabih  N.  Mikhail, 
Liberty  University 


1:30  p.m.  -  3:30  p.m. 

Contributed  Papers:  Hardware  and  Software  Reliability,  Chaired  by:  Asit  Basu,  University 
of  Missouri 

“Linear  prediction  of  failure  times  of  a  repairable  system,”  M.  Ahsanullah,  Rider  College 

“The  simulation  of  life  tests  with  random  censoring,”  Joseph  C.  Hudson,  GMI  Engineering 
and  Management  Institute 

“The  use  of  general  modified  exponential  curves  in  software  reliability  modeling,” 

Taghi  M.  Khoshgoftaar,  Florida  Atlantic  University 

“A  model  for  information  censoring,”  William  A.  Link,  Patuxent  Wildlife  Research  Center 

“Increasing  reliability  of  multiversion  fault-tolerant  software  design  by  modulation,”  Junryo 
Miyashita,  California  State  University,  San  Bernardino 
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1:30  p.m.  -  3:30  p.m. 

Contributed  Pajjers:  Applications  I,  Chaired  by:  Susannah  Schiller,  National  Bureau  of 
Standards 

“Classifying  linear  mixtures  with  an  application  to  high  resolution  gas  chromatography,” 
William  S.  Rayens,  University  of  Kentucky 

“Bias  of  animal  trend  estimates,”  Paul  H.  Geissler  and  William  A.  Link,  Patuxent  Wildlife 
Research  Center 

“A  non-random  walk  through  futures  prices  of  the  British  pound,”  William  S.  Mallios, 
California  State  University,  Fresno 

“A  stochastic  extension  of  Petri  net  graph  theory,”  L.  M.  Anneberg,  Wayne  State  University 
“Neural  Petri  nets,”  N.  H.  Chamas,  Wayne  State  University 
3:45  p.m.  -  5:45  p.m. 

SpecisJ  Invited  Session  for  Recent  Ph.D.’s,  Chaired  by:  John  J.  Miller,  George  Mason 
University 

“Additive  principal  components:  a  method  for  estimating  equations  with  small  variance 
from  multivariate  data,”  Deborah  Donnell,  Bellcore 

“Gamma  processes,  paired  comparisons  and  ranking,”  Hal  Stern,  Harvard  University 

“Smoothing  data  with  correlated  errors,”  Naomi  Altman,  Cornell  University 

“The  f^ata  viewer:  program  for  graphical  data  analysis,”  Catherine  Hurley,  University  of 
Waterloo 


3:45  p.m.  -  5:45  p.m. 

Simulation,  Chaired  by:  Donald  T.  Gantz,  George  Mason  University 

“Random  variables  for  supercomputers,”  George  Marsaglia,  Florida  State  University 

“Computational  statistics  in  experimental  design  for  studies  of  variability,”  John  Ramberg, 
University  of  Arizona 

“Linear  combinations  of  estimators  of  the  variance  of  the  sample  mean,”  Bruce  W. 
Schmeiser,  Purdue  University 
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3:45  p.m.  -  5:45  p.m. 

Symbolic  Computation  and  Statistics,  Chaired  by:  William  S.  Rayens,  University  of 
Kentucky 

“Some  applications  of  symbol  manipulation  in  statistical  analysis,”  Kathryn  M.  Chaloner, 
University  of  Minnesota 

“Symbolic  computation  in  statistical  decision  theory,”  Marietta  Tretter,  Texas  A  &  M 
University 

“Partial  differentiation  by  computer  with  applications  to  statistics,”  John  W.  Sawyer,  Jr., 
Texas  Tech  University 


3:45  p.m.  -  5:45  p.m. 

Contributed  Papers:  Statistical  Graphics,  Chaired  by:  Robert  Launer,  Army  Research 
Office 

“Visual  multidimensional  geometry  with  applications,”  Alfred  Inselberg,  IBM  Scientific 
Center,  Los  Angeles  and  Bernard  Dimsdale,  University  of  California 

“Some  graphical  representations  of  multivariate  data,”  Masood  Bolorforoush  and 
Edward  J.  Wegman,  George  Mason  University 

“Graphical  representations  of  main  effects  and  interaction  effects  in  a  polynomial  regression 
on  several  predictors,”  William  DuMouchel,  BBN  Software  Products  Corporation 

“Chernoff  faces:  a  PC  implementation,”  Mohammad  D^ldashzadeh,  University  of  Detroit 

3:45  p.m.  -  5:45  p.m. 

Contributed  Papers:  Models  of  Imprecision  in  Expert  Systems,  Chaired  by: 

Mark  Youngren,  George  Washington  University 

“Fusion  and  propagation  of  giapliica!  belief  models,”  Russell  Almond,  Harvard  University 

“Belief  function  computations  for  paired  comparisons,”  David  Tritchler  and  Gina  Lockwood, 
Ontario  Cancer  Institute 

“Variants  of  Tierney- Kadane,”  Guenter  Weiss  and  H.  A.  Howlader,  University  of  Winnepeg 

“Dynamically  updating  relevance  judgements  in  probabilistic  informa(,ion  systems  via  users’ 
feedback,”  Peter  Lenk  and  Barry  D.  Floyd,  New  York  University 

“Computational  requirements  for  inference  methods  in  expert  systems:  a  comparative 
study,”  Ambrose  Goicoechea,  George  Mason  University 
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3:45  p.m.  -  5:45  p.m. 

Contributed  Papers:  Time  Series  Methods,  Chaired  by:  Neil  Gerr,  Office  of  Naval 
Research 

“Inference  techniques  for  a  class  of  exponential  time  series,”  V.  Chandrasekar  and 
Peter  Brockwell,  Colorado  State  University 

“Some  recursive  methods  in  time  series  analysis,”  Q.  P.  Duong,  Bell  Canada 

“Time  series  in  a  microcomputer  environment,”  John  Henstridge,  Numerical  Algorithms 
Group 

“Smoothing  irregular  time  series,”  Keith  W.  Hipel,  University  of  Waterloo,  A.  I.  McLeod, 
The  University  of  Western  Ontario  and  Byron  Bodo,  Ministry  of  the  Environment 

“Computation  of  the  theoretical  autocovariance  function  of  multivariate  ARM  A  processes,” 
Stefan  Mittnik,  SUNY  at  Stony  Brook 


FRIDAY,  APRIL  22,  1988 


8:00  a.m.  -  10:00  a.m. 

Computer-Communication  Networks,  Chaired  by:  Martin  Fischer,  Defense  Communication 
Engineering  Center 

“Introduction  to  packet  switching  networks,”  Jeffrey  Mayersohn,  BBN  Communication 
Corporation 

“Electronic  mail  -  a  valuable  augmentation  tool  for  scientists,”  Elizabeth  Feinler, 

SRI  International 

“Networks  to  support  science,”  Stephen  Wolff,  National  Science  Foundation 
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8:00  a.m.  -  10:00  a.m. 

Supercomputing,  Design  of  Experiments  and  Bayesian  Analysis,  Part  I,  Chaired  by: 

Jerry  Sacks,  University  of  Illinois 

“Acceleration  methods  for  Monte  Carlo  integration  by  Bayesian  inference,”  John  Geweke, 
Duke  University 

“Software  for  Bayesian  analysis:  current  status  and  additional  needs,”  Prem  K.  Goel, 

Ohio  State  University 

“Some  numerical  and  graphical  stategies  for  implementing  Bayesian  methods,” 

Adrian  Smith,  University  of  Nottingham 

8:00  a.m.  -  10:00  a.m. 

Numerical  Methods  for  Statistics,  Chaired  by:  Stephen  Nash,  George  Mason  University 

“Interior  pwint  methods  for  linear  programming,”  Paul  Boggs,  National  Bureau  of  Standard 

“Block  iterative  methods  for  parallel  optimization,”  Stephen  Ncish  and  Ariela  Sofer,  George 
Mason  University 

“New  methods  for  B-differentiable  functions:  theory  and  applications,”  Jong-Shi  Pang, 

The  Johns  Hopkins  University 

8:00  a.m.  -  10:00  a.m. 

Contributed  Papers:  Probability  and  Stochastic  Processes,  Chaired  by:  Yash  Mittal, 
National  Science  Foundation 

“Moving  window  detection  for  0-1  Markov  trials,”  Joseph  Glaz,  University  of  Connecticut, 
Philip  C.  Hormel,  CIBA-GEIGY  Corporation  and  Bruce  McK.  Johnson,  University  of 
Connecticut 

“Maximum  queue  size  and  hashing  with  lazy  deletion,”  Claire  M.  Mathieu,  Laboratoire 
d’Informatique  de  I’Ecole  Normale  Superieure  and  Jeffrey  S.  Vitter,  Brown  University 

“On  the  probability  integrals  of  the  multivariate  normal,”  Dror  Rom  and  Sanat  Sarkar, 
Temple  University 

“Computational  aspects  of  harmonic  signal  detection,”  Keh-Shin  Lii  and  Tai-Houn  Tsou, 
University  of  California,  Riverside 

“Maximum  likelihood  estimation  of  discrete  control  processes:  theory  and  application," 
John  Rust,  University  of  Wisconsin 
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8:00  a.m.  -  10:00  a.m. 

Contributed  Papers:  Statistical  Methods  II,  Chaired  by:  ClifT  Sutton,  George  Mason 
University 

“Computing  extended  maximum  likelihood  estimates  in  generalized  linear  models,” 

Douglas  B.  Clarkson,  IMSL,  Inc.  and  Robert  I.  Jennrich,  University  of  California,  Los 
Angeles 

“Assessment  of  prediction  procedures  in  multiple  regression  analysis,”  Victor  Kipnis, 
University  of  Southern  Florida 

“Estimation  of  the  variance  matrix  for  maximum  likelihood  parameters  by  qu£isi-Newton 
methods,”  Linda  Pickle,  National  Cancer  Institute  and  Garth  P.  McCormick,  George 
Washington  University 

“Variable  selertijii  in  multivariate  multiresponse  permutation  procedures,”  Eric  P.  Smith, 
Virginia  Tech 

“The  effect  of  small  covariate-criterion  correlations  on  analysis  of  covariance,” 

Michael  J.  Rovine,  A.  von  Eye  and  P.  Wood,  Pennsylvania  State  University 

8:00  a.m.  -  10:00  a.m. 

Contributed  Papers:  Nonparametric  and  Robust  Techniques,  Chaired  by:  Paul  Speckman, 
University  of  Missouri 

“Robustness  of  weighted  estimators  of  location:  a  small  sample  survey,”  Greg  Campbell 
and  Richard  I.  Shrager,  NIH 

“A  comparison  of  Spearman’s  footrule  and  rank  correlation  coefficient  with  exact  tables  and 
approximations,”  LeRoy  A.  Franklin,  Indiana  State  University 

“Approximations  of  the  Wilcoxon  test  in  small  samples  with  lots  of  ties,” 

Arthur  R.  Silverberg,  Food  and  Drug  Administration 

“Simulated  power  comparisons  of  MRPP  rank  tests  and  some  standard  score  tests,” 

Derrick  S.  Tracy  and  Khushnood  A.  Khan,  University  of  Windsor 

10:15  a.m.  -  12:15  p.m. 

Special  Invited  Lecture  II,  Chaired  by:  Mervin  Muller,  Ohio  State  University 

“Some  modern  quality  improvement  techniques  and  their  computing  implications,” 

George  E.  P.  Box,  University  of  Wisconsin 

Special  invited  discussion,  Gerald  J.  Hahn,  GE  CRD  and  Gregory  B.  Hudak,  Scientific 
Computing  Associates 


xxviii 


FRIDAY,  APRIL  21,  1988 


10:15  a.in.  -  12:15  p.m. 

Supercomputing,  Design  of  Experiments  and  Bayesian  Analysis,  Part  II,  Chaired  by: 

Prem  K.  Goel,  Ohio  State  UniversHy 

“Supercomputer-aided  design,”  Jerry  Sacks,  University  of  Illinois 

“A  Bayesian  approach  to  the  design  and  analysis  of  computer  experiments,”  Toby  Mitchell, 
Oak  Ridge  National  Lab 

10:15  a.m.  -  12:15  p.m. 

Neural  Networks,  Chaired  by:  Muhammed  Habib,  University  of  North  Carolina 

“Statistical  learning  networks:  a  unifying  view,”  Andrew  R.  Barron,  University  of  Illinois 
and  Roger  L.  Barron,  Barron  Associates,  Inc. 

“Stochastic  models  of  neuronal  behavior,”  Gopinath  Kallianpur,  University  of  North 
Carolina 

“Inference  for  stochastic  models  for  neural  networks,”  Muhammed  Habib,  University  of 
North  Carolina  and  A.  Thavaneswaran,  Temple  University 

10:15  a.m.  -  12:15  p.m. 

Contributed  Papers:  Applications  II,  Chaired  by:  Brian  Woodruff,  Air  Force  Office  of 
Scientific  Research 

“Space  Balls!  or  estimating  diameter  distributions  of  polystyrene  microspheres,” 

Susannah  Schiller  and  Charles  Hagwood,  National  Bureau  of  Standards 

“Comparing  sample  reuse  methods  at  FHA  -  an  empirical  approach,”  Thomas  N.  Herzog, 

U.  S.  Department  of  Housing  and  Urban  Development 

“Maximum  entropy  and  its  application  to  linguistic  diversity,”  R.  K.  Jain,  Memorial 
University  of  Newfoundland 

“Encoding  and  processing  of  Chinese  language  -  a  statistical  structural  approach,” 

Chaiho  C.  Wang,  George  Washington  University 

“The  elimination  of  quantization  bias  using  dither,”  Martin  J.  Garbo  and 
Douglas  M.  Dreher,  Hughes  Aircraft  Company 

10:15  a.m.  -  12:15  p.m. 

Contributed  Papers:  Image  Processing  II,  Chaired  by:  Refik  Soyer,  George  Washington 
University 

“Maximum  entropy  and  the  nearly  black  image,”  Iain  Johnstone,  Stanford  University  and 
David  Donoho,  University  of  California,  Berkeley 

“A  probabilistic  approach  to  range  image  description,”  Arun  Sood,  George  Mason  University 
and  E.  Al-Hujazi,  Wayne  State  University 
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“An  empirical  Bayes  decision  rule  of  two-class  pattern  recognition  for  one-dimensional 
parametric  distributions,”  Tze  Fen  Li,  Rutgers  University 

“Statistical  modeling  of  a  priori  information  for  image  processing  problems.”  Z.  Liang.  Duke 
University  Medical  Center 

“Advanced  statistical  computations  improve  image  processing  applications,  Bobby  Saffari, 
Generex  Corporation 

10:15  a.m.  -  12:15  p.m. 

Contributed  Papers:  Simulation  I,  Chaired  by:  Bill  DuMouchel,  BBN 

“On  comparrtive  accuracy  of  multivariate  nonnormal  random  number  generators,” 

Lynne  K.  Edwards,  University  of  Minnesota 

“Bayesian  analysis  using  Monte  Carlo  integration:  an  effective  methodology  for  handling 
some  difficult  problems  in  statistical  analysis,”  Leland  Stewart,  Lockheed  Research 
Laboratory 

“A  squeeze  method  for  generating  exponential  power  variates,”  Dean  M.  Young,  Baylor 
University 

“Mixture  experiments  and  fractional  factorials  used  to  tailor  large-scale  computer 
simulation,”  T.K.  Gardenier,  TKG  Consultants,  Ltd. 

“Simulating  stationary  Gaussian  ARM  A  time  series,”  Terry  J.  Woodfield,  SAS  Institute, 
Inc. 


2:00  p.m.  -  4:00  p.m. 

Tales  of  the  Unexpected:  Successful  Interdisciplinary  Research,  Chaired  by:  Sallie  McNulty, 
Kansas  State  University 

“Some  statistical  problems  in  meteorology,”  Grace  Wahba,  University  of  Wisconsin 

“Modeling  parallelism,  an  interdisciplinary  approach,”  Elizabeth  Unger,  Kansas  State 
University 

“Mice,  rain  forests  and  finches:  experiences  collaborating  with  biologists,”  Douglas  Nychka, 
North  Carolina  State  University 

Discussion:  Jerome  Sacks,  University  of  Illinois 
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2:00  p.m.  -  4:00  p.m. 

Density  Estimation  and  Smoothing,  Chaired  by:  David  Scott,  Rice  University 

“XploRe:  computing  environment  for  exploratory  regression  and  density  estimation 
methods,”  Wolfgang  Hardle,  University  of  Bonn 

“Curve  estimation  with  applications  to  mapping  and  risk  decomposition,”  Michael  Tarter, 
University  of  California,  Berkeley 

“Interactive  multivariate  density  estimation  in  the  S  package,”  David  Scott,  Rice 
University 

2:00  p.m.  -  4:00  p.m. 

Object  Oriented  Programming,  Chaired  by:  Werner  Stuetzle,  University  of  Washington 

“Object  oriented  programming:  a  tutorial,”  Wayne  Oldford,  University  of  Waterloo 

“An  object  oriented  toolkit  for  plotting  and  interface  construction,”  Robert  Young, 
Schlumburger,  Palo  Alto  Research  Center 

“An  outline  of  Arizona,”  John  MacDonald,  University  of  Washington 


2:00  p.m.  -  4:00  p.m. 

Contributed  PapKsrs:  Numerical  Methods,  Chaired  by:  Ariela  Sofer,  George  Mason 
University 

“A  theorgy  of  quadrature  in  applied  probability:  a  fast  algorithmic  approach,”  Allen  Don, 
Long  Island  University 

“Higher  order  functions  in  numerical  programming,”  David  Gladstein,  ICAD 

“A  numerical  comparison  of  EM  and  quasi-Newton  typje  algorithms  for  finding  MLE’s  for  a 
mixture  of  normal  distributions,”  Richard  J.  Hathaway,  John  W.  Davenport  and  Margaret 
Anne  Pierce,  Georgia  Southern  College 

“Numerical  algorithms  for  exact  calculations  of  early  stopping  probabilities  in  one-sample 
clinical  trials  with  censored  expionential  responses,”  Brenda  MacGibbon,  Concordia 
University,  Susan  Groshen,  University  of  Southern  California  and  Jean-Guy  Levreault, 
University  of  Montreal 

“An  application  of  quasi-Newton  methods  in  parametric  empirical  Bayes  calculations,” 
David  Scott,  Concordia  University 
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2:00  p.m.  -  4:00  p.m. 

Contributed  Papers:  Bayesian  Methods,  Chaired  by:  William  F.  Eddy,  Carne(^ie-Mellon 
University 

“Approaches  for  empirical  Bayes  confidence  intervals  with  application  to  exponential  scale 
parameters,”  Alan  E.  Gelfand  and  Bradley  P.  Carlin,  University  of  Connecticut 

“A  data  analysis  and  Bayesian  framework  for  errors-in-variables,”  John  H.  Herbert, 
Department  of  Energy 

“Bayesian  diagnostics  for  almost  any  model,”  Robert  E.  Weiss,  University  of  Minnesota 

“An  iterative  Bayes  method  for  classifying  multivariate  observations,”  Duane  E.  Wolting, 
Aerojet  Tech  Systems  Company 

“A  Bayesian  model  of  information  conbination  from  noisy  sensors,”  G.  Anandalingam, 
University  of  Pennsylvania 


2:00  p.m.  -  4:00  p.m. 

Contributed  Papers:  Expert  Systems  in  Statistics:  Chaired  by  Khalid  Abouri,  George 
Washington  University 

“Inside  a  statistical  expert  system:  implementation  of  the  ESTES  expert  system,” 

Paula  Hietala,  University  of  Tampere,  Finland 

“Knowledge-based  project  management:  work  effort  estimation,”  Vijay  Kanabar, 
University  of  Winnipeg 

“Combining  knowledge  acquisition  and  classical  statistical  techniques  in  the  development  of 
a  veterinary  medical  expert  system,”  Mary  McLeish,  University  of  Guelph 

“The  effect  of  measurement  error  in  a  machine  learning  system,”  David  L.  Rumpf  and 
Mieezyslaw  M.  Kokar,  Northeastern  University 

“An  expert  system  for  prescribing  statistical  tests  of  non-parametric  and  simple  parametric 
designs,”  Gary  Tubb,  University  of  South  Florida 
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8:30  a..m.  -  10:30  a.m. 

Computational  Aspects  of  Simulated  Annealing,  Chaired  by:  Mark  E.  Johnson,  Los  Alamos 
National  Lab 

“Computational  experience  with  simulated  annealing,”  Daniel  G.  Brooks  and 
William  A.  Verdini,  Arizona  State  University 

“Simulated  annealing  in  optimal  design  construction,”  Ruth  K.  Meyer,  St.  Cloud  State 
University  and  Christopher  J.  Nachtsheim,  University  of  Minnesota 

“A  simulated  annealing  approach  to  mapping  DNA,”  Larry  Goldstein  and 
Michael  J.  Waterman,  University  of  Southern  California 

8:30  a.m.  -  10:30  a.m. 

Dynamical  High  Interaction  Graphics,  Chaired  by:  Paul  Tukey,  Bellcore 

“Determining  properties  of  minimal  spanning  trees  by  local  sampling,”  Allen  McIntosh, 
Bellcore  and  William  Eddy,  Carnegie-Mellon  University 

“Data  animation,”  Rick  Becker,  AT&T  Bell  Labs  and  Paul  Tukey,  Bellcore 

“Dimensionality  constraints  on  projection  and  section  views  of  higher  dimensional  loci,” 
George  Furnas,  Bellcore 

8:30a.m.  -  10:30  a.m 

Contributed  Papers:  Statistical  Methods  III,  Chaired  by:  Thomas  Mazzuchi, 

George  Washington  University 

“Simultaneous  confidence  intervals  in  the  general  linear  model,”  Jason  C.  Hsu, 

Ohio  State  University 

“Empirical  likelihood  ratio  confidence  regions,”  Art  Owen,  Stanford  University 

“An  approximate  confidence  interval  for  the  optimal  number  of  mammography  x-ray  units 
in  the  Dallas-Fort  Worth  metropolitan  area,”  Roger  W.  Peck,  University  of  Rhode  Island 

“Optimizing  linear  functions  of  random  variables  having  a  joint  multinomial  or  multivariate 
normal  distribution,”  Josephina  P.  de  los  Reyes,  University  of  Akron 

“On  covariances  of  marginally  adjusted  data,”  James  S.  Weber,  Roosevelt  University 

8:30  a.m.  -  10:30  a.m. 

Contributed  Papers:  Simulation  II,  Chaired  by  :  Robert  Jernigan,  American  University 

“SIMDAT  and  SIMEST:  differences  and  convergences,”  James  R.  Thompson,  Rice 
University 

“Simulation  and  stochastic  modeling  for  the  spatial  allocation  of  multi-categorical 
resources,”  Richard  S.  Segall,  University  of  Lowell 
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“Robustness  study  of  some  random  variate  generators,”  Lih-Yuan  Deng,  Memphis  State 
University 

“Testing  multiprocessing  random  number  generators,”  Mark  J.  Durst,  Lawrence  Livermore 
National  Laboratory 

“An  approach  for  generations  of  two  variable  sets  with  a  specified  correlation  and  first  and 
second  sample  moments,”  Mark  Eakin  and  Henry  D.  Crockett,  University  of  Texas  at 
Arlington 

8:30  a.m.  -  10:30  a.m. 

Contributed  Papers:  Biostatistics  Applications,  Chaired  by:  Nancy  Flournoy,  National 
Science  Foundation 

“An  algorithm  to  identify  changes  in  hormone  patterns,”  Morton  B.  Brown,  Fred  J.  Karsch 
and  Benoit  Malpaux,  University  of  Michigan 

“Applying  microcomputer  techniques  to  multiple  cause  of  death  data:  from  magnetic  tape 
to  artificial  intelligence,”  Giles  Crane,  New  Jersey  State  Department  of  Health 

“Spline  estimation  of  death  density  using  census  and  vital  statistics  data,”  John  J.  Hsieh, 
University  of  Toronto 

“Optimum  experimental  design  for  sequential  clinical  trials,”  Richard  Simon,  National 
Cancer  Institute 

“Bayes  estimation  of  cerebral  metabolic  rate  of  glucose  in  stroke  patients,”  P.  David  Wilson, 
University  of  South  Florida,  S.  C.  Huang  and  R.  A.  Hawkins,  UCLA  School  of  Medicine 

8:30  a.m.  -  10:30  a.m. 

Contributed  Papers:  Discrete  Mathematical  Methods,  Chaired  by:  Donald  Gantz,  George 
Mason  University 

“Minimum  cost  path  planning  in  the  random  traversability  speice,”  A.  Meystel,  Drexel 
University 

“Algorithms  to  reconstruct  a  convex  set  from  sample  points,”  Marc  Moore,  Ecole 
Polytechnique  Montreal  and  McGill  University,  Y.  Lemay,  Bell  Canjida,  and 
S.  Archambault,  Ecole  Polytechnique  Montreal 

“On  the  geometric  probability  of  discrete  lines  and  circular  arcs  approximating  arbitrary 
object  boundaries,”  Chang  Y.  Choo,  Worchester  Polytechnic  Institute 

“Application  of  orthogonalization  procedures  to  fitting  tree-structured  models,” 

Cynthia  O.  Siu,  The  Johns  Hopkins  University 

“Evaluation  of  functions  over  lattices,”  Michael  Conlon,  University  of  Florida 
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10:45  a.m.  -  12:00  noon 

Special  Invited  Lecture  III,  Chaired  by:  Sally  Howe,  National  Bureau  of  Standards 
“Visualizing  high  dimensional  spaces,”  Thomas  Banchoff,  Brown  University 
10:45  a.m.  -  12:45  p.m. 

Entropy  Methods,  Chaired  by:  Raoul  LePage,  Michigan  State  University 

“Introduction  to  relative  entropy  method  s,”  John  Shore,  Entropic  Processing  Corporation 

“Structural  covariance  matrices  and  2-dimensional  spectra,”  John  Burg,  Entropic  Processing 
Corporation 

“Matrix  completion  and  determinants,”  Charlie  Johnson,  College  of  William  and  Mary 
10:45  a.m.  -  12:45  p.m. 

Contributed  Papers:  Information  Systems,  Databases  and  Statistics,  Chaired  by: 

Robert  Teitel,  Teitel  Data  Services 

“Information  systems  and  statistics,”  Nancy  Flournoy,  National  Science  Foundation 

“Is  there  a  need  for  a  statistical  knowledge  base?”  Z.  Chen,  Louisiana  State  University 

“An  alternate  methodology  for  subject  database  planning,”  Craig  W.  Slinkman,  Henry  D. 
Crockett,  and  Mark  Eakin,  University  of  Texas  at  Arlington 

“A  sensitivity  analysis  of  the  Herfindal-Hirschman  Index,”  James  R.  Knaub,  Jr., 

U.  S.  Department  of  Energy 

”Statistical  methods  for  document  retrieval  and  browsing,”  Jan  Pedersen,  Xerox  PARC  and 
John  Tukey  and  P.  K.  Halvorsen 

10:45  a.m.  -  12:45  p.m. 

Contributed  Papers:  Parallel  Computing,  Chaired  by:  Joseph  Brandenburg,  INTEL 
Scientific  Computers 

“Programming  the  BBN  butterfly  parallel  processor,”  Pierre  duPont,  BBN  Advanced 
Computers 

“A  tool  to  generate  parallel  FORTRAN  code  for  the  Intel  iPSC/2 
hypercube,”  Carlos  Gonzalez,  J.  Chen  and  J.  Sarma,  George  Mason  University 

“All-subsets  regression  on  a  hypercube  multiprocessor,”  Peter  Wollan,  Michigan 
Technological  University 

“Multiply  twisted  N-cubes  for  multiprocessor  parallel  computers,”  T.H.  Shiau,  University  of 
Missouri,  Columbia 

“Markov  chains  arising  in  collective  computation  networks  with  additive  noise," 

R.H.  Baran,  Naval  Surface  Warfare  Center 
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:45  a.m.  -  12:45  p.m. 

Contributed  Papers:  Density  and  Function  Estimation,  Chaired  by:  Celesta  Ball,  George 
Mason  University 

“The  Lj  asymptotically  optimal  kernel  estimate,”  Luc  Devroye,  McGill  University 

“Derivative  estimation  by  polynomial-trigonometric  regression,”  Paul  Speckman,  University 
of  Missouri,  Columbia  and  R.L.  Eubank,  Southern  Methodist  University 

“A  pooled  error  density  estimate  for  the  bootstrap,”  Walter  Liggett,  National  Bureau  of 
Standards 

“Efficient  algorithms  for  smoothing  spline  estimation  of  functions  with  or  without 
discontinuities,”  Jyh-Jen  Horng  Shiau,  University  of  Missouri,  Columbia 

“On  the  convergence  of  variable  bandwidth  kernel  estimators  of  a  density  function,” 

Ting  Yang,  University  of  Cincinnati 

10:45  a.m.  -  12:45  p.m. 

Contributed  Papers:  Statistical  Methods  IV,  Chaired  by:  LeRoy  A.  Franklin, 

Indiana  State  University 

“Stochastic  test  statistics,”  P.  Warwick  Millar,  University  of  California,  Berkeley 

“It’s  time  to  stop!,”  Hubert  Lilliefors,  George  Washington  University 

“The  effects  of  heavy  tailed  distributions  on  the  two  sided  k-sample  Smirnov  test,” 

Henry  D.  Crockett  and  M.  M.  Whiteside,  University  of  Texas  at  Arlington 

“Performance  of  several  one  sample  procedures,”  David  Turner,  Utah  State  University 

“Exact  power  calculation  for  the  chi-square  test  of  two  proportions,”  Carl  E.  Pierchala, 
Food  and  Drug  Administration 
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1.  KEYNOTE  ADDRESS 

Computer  Intensive  Statistical  Inference 
Bradley  Efron,  Stanford  University 


Computer  Intensive  Statistical  Inference 

Bradley  Efron 
Department  of  Statistics 
Stanford  University 


Abstract:  We  discuss  three  recent  data 
analyses  which  illustrate  making  statistical 
inferences  (flnding  significance  levels,  confidence 
intervals  and  standard  errors)  with  the  critical 
assistance  of  the  computer.  The  first  example 
concerns  a  permutation  test  for  a  linear  model 
situation  with  several  covariates.  We  provide  a 
computer-based  compromise  between  complete 
randomization  and  optimum  design,  partially 
answering  the  question  “how  much  randomization 
is  enough?”  A  problem  in  particle  physics  provides 
the  second  example.  We  use  the  bootstrap  to  find 
a  good  estimator  for  an  interesting  decay 
probability  and  then  to  obtain  a  believable 
confidence  interval.  The  third  problem  involves  a 
long-running  cancer  trial  in  which  the  z-value  in 
favor  of  the  more  rigorous  treatment  wandered 
extensively  during  the  course  of  the  experiment.  A 
dubious  theory,  which  suggests  that  the  wandering 
is  just  due  to  random  noise,  is  rendered  more 
believable  by  a  bootstrap  analysis.  All  three 
example  illustrate  the  tendancy  for  computer- 
based  inference  to  raise  new  pwints  in  statistical 
theory.  (  Eklitors  note:  Professor  Efron  provided 
this  abstract  together  with  the  following  examples 
which  were  his  handout  to  summarize  his  Keynote 
Address.] 

Mouse  Data:  Ordinary  Permutation  Test 

♦  Two  groups  of  mice,  “A”=Trcatmcnt  (7  mice) 
and  “B”=Control  (9  mice). 

♦  For  each  mouse,  measured  survival  time  in  days 
after  surgery 

A:  94  197  16  38  99  141  23  Mcan=  86.9 
B:  52  104  146  10  50  31  40  27  46  Mean=56.2 
Difference=30.6 

♦  1000  random  divisions  of  the  16  numbers  into 
groups  of  7  and  9  gave  1000  corresponding  values 
of  Difference  =  Mean  A  —  Mean  B.  {In  other 
words  we  permute  labels  “A”  and  “B.”} 


♦  Of  these  126  exceeded  Difference  =  30.6,  for  an 
attained  significance  level  (asl)  =  .126. 


XOOO  parsa  aouia  data 

MO- 


Diff* 


The  14  Scoliosis  Patients 
Usual  Linear  Model 

y=  T^  -|-  X  tt  -f  f 

14x1  14x1  14x6  6x1  14x1 

where  t  is  distributed  as  n(0,  ir^l).  Usual  ANOVA 
test  for  Hq:  /7=0  rejects  IIq  for  large  values  of 

S  =  T'i/(  !T|  |y|  ) 

where  T  and  "y  are  projections  orthogonal  to  i.(X) 
(equivalent  to  t-test  for  /?  =  0). 

Data  was  actually  generated  from 

y  =  age  +  .667xT  -(-  t 

where  f  = -0.16  0.31  2.22  -1.49  -0.66  3.71  2.49 
-0.87  -1.37  2.57  -3.47  0.09  -5.23  1.95. 
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The  14  Scoliosis  Patients 
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DATA  MATRIX 


♦  Gave  S  =  0.557.  Reject  Hg? 

♦  Usual  t-test  gave  asl  =  P{  t7>1.77  }  =  .060. 

♦  Compare  S  with  values  of  S*  obtained  by 
permuting  T  to  T  (i.e.,  permuting  -Is  and  Is). 

♦  Choose  400  T*  vectors  randomly  from  = 

3432  possibilities. 

♦  Would  like  T*  to  be  uniformly  distributed  in 

i(x). 

♦  For  Vj,  Vj,  ...  ,  Vg  an  orthonormal  basis  for 

i(x), 

-L. 

looked  at  projections  of  T*  vectors  along  each  v^-; 
counted  #  projections  in  deciles  of  “perfectly 
uniform”  distribution. 


t(X) 


count*  (or  400  Tvaci 
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'top  40*  p«ra  valvMS 


*  Projection  along  V4  very  non-uniform,  so  choose 
another  400  T*  vectors. 

♦  Not  much  better,  but  these  are  the  ones  I 
decided  to  use. 

Each  T*  gives  S*  =  T*  "y/  (  |  T*|  |  )  with  y 

fixed  as  shown.  Of  the  400  S*  values,  25.5 
exceeded  S  =  0.557  giving  asl  =  25.5/400  =  .064. 

♦  Ideal  T  vector  would  have  7‘(T)  =  |  T|^  =  14, 
that  is,  T±JL(X) 

♦  Var{  I  T  }  =  <r^/T(T)  in  usual  model. 

♦  If  T  chosen  randomly  from  400,  mean  {r(T)}  = 
8.43. 


•> 


400  parautation  valuaa 


♦  I  chose  T  randomly  from  “top  40,”  i.e.  those  40 
T  vectors  having  greatest  7-(T)  values. 

Mean  {r(T)|Top  40}  =  12.53. 

♦  Actually  got  T372,  with  r372=12.28. 

♦  1.5  of  the  40  S*  values  for  the  “top  40”  reference 
set  exceeded  S=0.557,  asl  =  .038. 

permutation  asl,  all  400  =  .064  {=25/400} 
permutation  asl,  top  40  =  .038  {=1.5/40} 

NOTE:  Binomial  SD  for  asl  is  .041 
anova  asl  =  .060  {=Prob  [  17  >  1.77  ]  } 


Maybe  top  40  T*  vectors  all  point  in  the  same 
direction!  No,  their  direction  counts  are  reasonably 
uniform  in  JL(X),  except  near  V4.  Here  the  cosines 
of  angle  (y ,  v,): 
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The  Tau  Data 

♦  Occurrence  rates  of  five  different  tau  decay 
events  estimated,  Bl,  B2,  B3,  B4,  B5.  (Also  the 
“estimated”  SD  for  each. 

♦  Should  have  D  =  Bl-(B2-t-B3-hB4-|-B5)  =  0. 

♦  In  fact,  D  =  18.25  (or  16.90). 

♦  Wanted:  a  central  98%  confidence  interval  for  D. 

♦  Normal  theory:  D  €  (15.41,  21.09). 
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All  T  events 


♦  Tried  different  trimmed  means:  0,  .1,  .2,  .25,  .3, 
.5. 

♦  For  each  trim,  evaluated  bootstrap  SD  estimate. 

♦  “.3”  gives  lowest  SD  estimate  for  D,  but  “.25” 
easier  to  explain. 

♦  Choose  “.25”  for  remainder  of  analysis. 

♦  Bootstrap  SD  estimates  based  on  200  bootstrap 

AAA 

replications  for  each  Bl,  B2,  . . .  ,  B5. 
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I — observed  data-j 

A  A  >S 

(BIJ.BI2.....BI  ) 


1  • • .-‘ij 

7:  iid  ^  ^  a  ^  a 

—  (Blj.Bl^,.. ..B1j3)  -  B1 

I — bootstrap  data-1 


25% 

trimmed 

mean 

A 

►  B1 


25% 

trimmed 

mean 


j Repeat 
(  200 
[times 


♦  Then  SD(D)  =  [SD(B^)2+ ... +SD(bV]^. 

_  A  A 

♦  Here  are  histograms  of  Bl*,  ,  B5*  and  also  of 
D*=B1*  —  (Bl*+  •••  +  B5*).  500  boostraps  each. 

♦  Percentiles  of  D*  were  14.20  (.01)  and  19.34 
(.99). 


*  Approximate  confidence  intervals  for  D: 


.01 

.99 

BC 

14.29 

19.53 

®Ca 

14.25 

19.49 

Boot-T 

14.73 

18.99 

Boot-T 

(smooth) 

14.22 

19.20 

Normal 

Theory 

15.41 

21.09 

Bootstrap  T 

*  Let  T  =  ^  ^  where  a  is  the  jatckknife  estimate 

a 

of  SD(D). 

a  Use  bootstrap  to  estimate 

A  A 

♦  2000  bootstraps  of  T*=  —  gave  estimated 

(T 

percentiles 
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BooC'T  for  tou4ato 


=  -2.63  and  =  2.73  with 

[D-ffx2.73,  D-»-ffx2.63]  =  [14.73,  18.99] 

“Boot-T" 

*  Smoothed  bootstrap:  Draw  from 
,=Fi0N(O,ffJ),  etc.  Gave  -2.97  and 

3.31  so 

[D-ffx3.31,  D-hffx2.97]  =  [14.22,  19.20] 

“Boot-T  smooth” 

Reference:  Efron,  B.  and  Tibshirani,  R.  (1986). 
"Bootstrap  methods  for  standard  errors,  confidence 
intervals  and  other  measures  of  statistical 
accuracy,”  StatisUcal  Science,  1,  54-77. 
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*  Randomized  clinical  >  trial  for  head  and  neck 
cancer. 

*  Data  as  of  June  30,  1985. 

*  51  patients  in  “A,”  radiation. 


♦  “r”  =  proportion  of  total  experience  y  for  all 
patients  at  that  date  compared  to  ^  y  on  June  30, 
1985). 

*  Question:  Was  treatment  B  relatively  more 
effective  early  in  the  experiment? 


*  46  patients  in  “B,”  radiation  plus  chemotherapy. 

*  y  =  time  to  relapse  (days). 

*  d  =  U  if  relapse 

10  censored 

*  km  =  Kaplan-Meier  survival  curve. 

*  Log-rank  (Mantel-Haenszel)  test  for  equality  of 
survival  was  z  =  2.29  for  an  attained  level  of 
significance  l-$(2.29)  =  .011. 


Some  dubious  theory:  Let  “zj”  be  the  z-value 
when  the  proportion  r  of  the  total  data  is 
available  (so  Zj  =  final  z-value).  Then 

(a)  E{zi.}/E{zi}  = 

(b)  z,  ~  N(rfT  E{zi},  1) 

(c)  z,  I  z,  ~  N(^  E{zi},  1  -  r) 

(d)  Zj^and  Zj^  are  approximately  bivariate  normal 


f  <  HC  IK  If  OH  T  M$ 


»  z-value  of  log-rank  test  at  various  calendar 
times. 


♦  z  =  2.31  on  6/30/81.  Experiment  nearly  halted. 


-a  -t  0  1  a  a  «  t 

Bootstrap  investigation  of  :-value  on  June  30,  1983:  r  = 


froa  RCOCTbai 


1/1/79  '80 

'81 

'82 

'83 

'84 

•85 

r  .  10% 

25% 

40% 

60% 

78% 

96% 

♦  Consider  as  fixed  the  72  entry  dates  (38  for  A, 
34  for  B)  occurring  before  June  30,  1982. 

♦  For  entry  date  e,-  compute  C;  =  #  of  days  from 
e^  until  June  30,  1982. 

♦  Let  YJ,  YJ,  ...  ,  YJg  be  i.i.d.  draws  from  km^, 
the  final  Kaplan-Meier  curve  as  of  June  30,  1985. 

♦  For  each  YJ,  let  y*  =  min  (  Y*,  c,  )  and  d*  =  1 
or  0  as  yj  =  Y*  or  c^. 
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♦  Likewise  draw  YJ,  YJ,  ...  ,  Y^^  from  kmg,  the 

final  Kaplan-Meier  curve  as  of  June  30,  1985.  * 

*  Then  compute  zj,  the  log- rank  z- value  for  the  * 
bootstrap  data. 

s 

♦  200  z;'s  ~  N(1.56,  1.04^),  ^  E{zi}  =  1.61 

3 

*  Compare  with  N(1.61,  1)  from  (b)  ! 


*  Did  correspomding  zj's  for  r  =  .246  (January  1,  ® 

1981)? 


♦  Corr  (  Zfj,  Zfj  )  =  .727. 

*  Compare  with  (d),  corr  =  =  .706. 


kt  1/l/ai  *■  l-<rkl  kt  t/SO/SS 


« 


-10  I  3  3  4 

x-vklna  kt  S/30/n 


S 


f 


♦  Jagged  line:  Zj.  versus  r. 
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Abstract 

Consider  an  arbitrary  domain  of  interest  in  n-dimensional  Euclidean  space  and  an  unknown 
function  of  n  arguments  defined  on  that  domain.  Suppose  we  are  given  the  value  of  the  function 
(perhaps  perturbed  with  additive  noise)  at  some  set  of  points.  The  problem  is  to  find  a  function 
that  provides  a  reasonable  approximation  to  the  unknown  one  over  the  domain  of  interest.  This 
paper  presents  a  brief  review  of  current  methodology  aimed  at  dealing  with  this  problem,  and 
presents  a  new  technique  -  multivariate  adaptive  regression  splines  -  that  has  the  potential  to 
overcome  some  of  the  limitations  of  previous  approaches. 

1.0.  Introduction 

Suppose  a  system  under  study  can  be  described  (over  some  domain  D  £  /2")  by 

J/ = /(a:i,---,x„)  +  c  (1) 

where  j/  is  a  response  or  dependent  variable  of  interest,  Xj,  •  •  • ,  Zn  are  a  set  of  explanatory  or  inde¬ 
pendent  variables,  and  /  is  a  (deterministic)  single  valued  function  of  its  n-dimensional  argument. 
The  quantity  c  is  an  additive  random  or  stochastic  component  that  (if  nonzero)  reflects  the  fact 
that  y  depends  on  quantities  other  than  ii  •  ••x„  that  are  also  varying.  We  are  given  a  set  of 
values  {j/t,  Xii,  •  •  • ,  (xh,  •  ■  • ,  x„i)  £  D,  (training  sample)  and  the  purpose  of  the  exercise 

is  to  obtain  a  function  /(xi,---,i„)  that  provides  a  reasonable  approximation  to  /(xj ,  •  ■  • , x„). 
Here  reasonable  usually  means  accurate  since  one  often  wants  to  use  /  to  approximate  /  at  other 
points  not  part  of  the  training  sample.  If  in  addition  one  wants  to  use  /  to  try  to  understand  the 
properties  of  /  (and  thereby  the  system  that  provided  the  data)  then  the  interpretability  of  the 
representation  of  /  is  important.  It  is  also  sometimes  important  that  /  be  rapidly  computable.  In 
addition,  for  some  applications  it  is  important  that  /  be  a  smooth  function  of  its  argument;  that 
is,  at  least  its  low  order  derivatives  exist  everywhere  in  D. 

*  Research  supported  in  part  by  National  Security  Agency  Grant  MDA904-88-H-2029 
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In  low  dimensional  settings  (n  <  2)  successful  developments  have  occurred  in  two  general 
directions:  piecewise  polynomials  and  local  averaging.  The  basic  idea  of  piecewise  polynomials  is 
to  approximate  /  by  several  generally  low  order  polynomials  each  defined  over  a  different  subregion 
of  the  domain  D.  The  approximation  is  required  to  be  continuous,  and  sometimes  have  continuous 
low  order  derivatives.  The  tradeoff  between  smoothness  and  flexibility  of  the  approximation  /  is 
controlled  by  the  number  of  subregions  (knots)  and  the  order  of  the  lowest  derivative  allowed  to  be 
discontinuous  at  region  boundaries.  The  most  popular  piecewise  polynomial  fitting  procedures  are 
based  on  splines.  [See  deBoor  (1978)  for  a  general  review  of  splines  and  Schumacher  (1976),  (1984) 
for  reviews  of  some  two-dimensional  extensions.] 

Local  averaging  approximations  take  the  form 

N 

f{x)  =  Y^K{x,Xi)yi  (2) 

t=i 


where  K{x,x')  (called  the  kernel  function)  usually  has  its  maximum  value  at  a;'  =  x  with  its 
absolute  value  decreasing  as  \x  —  x'\  increases.  Thus,  f{x)  is  taken  to  be  a  weighted  average  of  the 
yi  where  the  weights  are  larger  for  those  observations  that  are  close  or  local  to  x.  For  n  >  1  the 
kernel  is  usually  taken  to  be  a  function  of  the  Euclidean  distance  between  the  points 


A^(x,x') 


K 


(3) 


Local  averaging  procedures  have  received  considerable  attention  in  the  statistical  literature  begin¬ 
ning  with  their  introduction  by  Parzen  (1962).  Stone  (1977)  has  shown  that  this  approach  has  de¬ 
sirable  asymptotic  properties.  They  have  also  seen  interest  from  the  mathematical  approximation 
literature  [Shepard  (1964),  Bozzini  and  Lenarduzzi  (1985)].  Roughness  penalty  methods  [smooth¬ 
ing  (n  =  1)  and  thin  plate  (n  =  2)  splines]  are  closely  related  to  kernel  methods  based  on  Euclidean 
distance  [see  Silverman  (1985)  and  Schumaker  (1976)]. 

The  direct  extension  of  piecewise  polynomials  (splines)  or  local  averaging  methods  to  higher 
dimensions  (n  >  2)  is  straightforward  in  principle  but  difficult  in  practice.  These  difficulties  are 
related  to  the  so-called  “curse-of-dimensionality”,  a  phrase  coined  by  Bellman  (1961)  to  express 
the  fact  that  exponentially  increasing  numbers  of  points  are  needed  to  densely  populate  Euclidean 
spaces  of  increasing  dimension.  In  the  case  of  spline  approximations,  extension  to  higher  dimen¬ 
sions  is  accomplished  through  tensor  products  of  univariate  spline  functions.  These  functions  are 
associated  with  a  grid  of  points  defined  by  the  outer  product  of  knot  positions  on  each  independent 
variable.  For  a  given  number  of  knots  K  on  each  variable,  the  size  of  the  grid,  and  thus  the  number 
of  approximating  basis  functions,  grows  as  /v  ”.  For  example,  in  six  dimensions  a  (tensor  product) 
cubic  spline  with  only  one  interior  knot  in  each  variable  has  15,625  coefficients  to  be  estimated. 
That  number  in  ten  dimensions  is  approximately  10^.  Even  though  only  one  interior  knot  per  vari¬ 
able  might  be  considered  a  very  coarse  grid,  it  still  requires  a  very  large  number  of  data  points  to 
estimate  the  corresponding  spline  approximation.  Finer  grids  require  many  more  points. 

Local  averaging  methods  suffer  a  similar  fate  as  the  dimension  of  the  function  argument  space 
increases.  For  example,  let  D  be  the  unit  hypercube  in  /?"  and  consider  a  uniform  kernel  with 
hypercubical  support  and  bandwidth  (edge  length)  covering  10  percent  of  the  range  of  each  co¬ 
ordinate.  Then,  if  the  data  are  roughly  uniformly  distributed  in  R”,  the  kernel  will  (on  average) 
contain  only  (0.1)”  of  the  sample,  thereby  nearly  always  being  empty  for  moderate  to  large  n.  If, 
on  the  other  hand,  one  adjusts  the  size  of  the  neighborhood  (bandwidth)  to  contain  10  percent  of 


14 


the  sample,  it  will  cover  (on  average)  (0.1)^/”  x  100  percent  of  the  range  of  each  variable,  resulting 
in  a  very  crude  approximation. 

This  problem  of  the  inherent  sparsity  of  practical  sampling  in  high  dimensions  basically  limits 
the  straightforward  application  of  both  piecewise  polynomials  and  local  averaging  methods  in  these 
settings.  It  does  not,  however,  limit  theoretical  investigation.  It  is  straightforward  to  imagine 
arbitrarily  densely  sampling  of  high  dimensional  spaces.  Asymptotic  theoretical  calculations  can 
then  be  done.  [See  Stone  (1977)  for  pioneering  work  in  this  area.]  The  (practical)  difficulty  lies 
only  in  obtaining  the  corresponding  large  samples  required  for  accurate  approximations.  It  should 
be  noted  in  addition,  that  local  averaging  approximations  (and  to  a  lesser  extent  tensor  product 
splines)  are  slow  to  compute  and  difficult  to  interpret. 

The  curse-of-dimensionality  is  fundamental  and  cannot  be  directly  overcome.  If  the  true  un¬ 
derlying  function  /(xi,  ■  •  •,ar„)  (1)  exhibits  strong  variation  of  no  special  structure  on  all  of  the 
variables  in  every  part  of  the  domain  D,  then  accurate  approximation  with  feasible  sample  sizes 
is  not  possible.  Fortunately,  very  few  functions  of  interest  exhibit  behavior  quite  this  dramatic. 
Generally  there  is  some  (sometimes  known,  more  often  unknown)  special  structure  associated  with 
the  function  that  can  be  exploited  by  a  sufficiently  clever  algorithm  to  reduce  the  complexity  and 
thereby  achieve  more  accurate  approximation. 

Function  approximation  in  high  dimensional  settings  has  been  pursued  mainly  in  statistics. 
The  principal  approach  taken  there  has  been  to  fit  an  especially  simple  parametric  form  to  the 
training  sample.  The  most  common  parameterization  is  the  linear  function 

n 

f{Xi,---,Xn)  =  O‘0  +  '^OliXi.  (4) 

«=1 

This  is  not  likely  to  produce  a  very  accurate  approximation  to  very  many  functions  in  /?",  but 
it  has  the  virtue  of  requiring  relatively  few  data  points,  it  is  easy  to  interpret,  and  it  is  rapidly 
computable.  Also,  if  the  stochcistic  component  e  (1)  is  large  compared  to  /,  then  the  variability  of 
the  estimate  dominates,  and  the  systematic  error  associated  with  this  simple  approximation  is  not 
the  most  serious  problem. 

Recently,  the  linear  model  has  been  generalized  nonparametrically  to  the  so-called  additive 
model 

n 

/(xi,---,x„)  =  ^/.(x.)  (5) 

1=1 

[Friedman  and  Stuetzle  (1981),  Breiman  and  Friedman  (1985),  Has.tie  and  Tibshirani  (1986),  Fried¬ 
man  and  Silverman  (1987)].  Here  the  {/i(a:,)}"  are  each  (different)  smooth  but  otherwise  arbitrary 
functions  of  a  single  variable.  Although  additive  models  are  still  not  able  to  accurately  approxi¬ 
mate  very  general  functions  in  fl”,  they  do  constitute  a  much  richer  class  than  the  simple  linear 
approximation  (4).  They  share  the  high  interpretability  of  the  linear  model  (one  can  view  the  uni¬ 
variate  functions  /,)  and  they  are  not  overly  difficult  to  compute. 

Linear  and  additive  approximations  lack  generality  in  that  they  have  limited  ability  to  adapt 
to  a  wide  variety  of  multivariate  functions  /.  Also,  as  the  sample  size  increases  there  is  a  limit 
to  the  accuracy  of  the  approximation  (unless  the  true  underlying  function  happens  to  be  exactly 
linear  or  additive  over  D). 

Strategies  that  attempt  to  approximate  general  functions  in  high  dimensionality  are  based  on 
adaptive  computation.  An  adaptive  computation  is  one  that  dynamically  adjusts  its  strategy  to 
take  into  account  the  behavior  of  the  particular  problem  to  be  solved,  e.g.  the  behavior  of  the 
function  to  be  approximated.  Adaptive  algorit''ms  have  been  in  long  use  in  numerical  quadrature 
[see  Lyness  (1970);  Friedman  and  vVright  (1987}.]  In  statistics,  adaptive  algorithms  for  function 
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approximation  have  been  developed  based  on  two  paradigms,  recursive  partitioning  [Morgan  and 
Sonquist  (1963),  Breiman,  Friedman,  Olshen,  and  Stone  (1984)],  and  projection  pursuit  [Friedman 
and  Stuetzle  (1981),  Friedman,  Grosse,  and  Stuetzle  (1983),  Friedman,  (1985)]. 

Projection  pursuit  uses  an  approximation  of  the  form 

M  /  n 

1  '  '  '  t  2^n)  —  ^  ^  fm  I  ^  ^ 

m=l  Vi=l 

that  is,  additive  functions  of  linear  combinations  of  the  variables.  The  univariate  functions,  fm, 
are  required  to  be  smooth  but  are  otherwise  arbitrary.  These  functions,  and  the  corresponding  co¬ 
efficients  of  the  linear  combinations  appearing  in  their  arguments,  are  jointly  optimized  to  produce 
a  good  fit  to  the  data  based  on  some  distance  (between  functions)  criterion  -  usually  squared-error 
loss.  It  can  be  shown  [see  Diaconis  and  Shahshahani  (1984)]  that  any  smooth  function  of  n  variables 
can  be  represented  by  (6)  for  large  enough  M.  The  effectiveness  of  the  approach  lies  in  the  fact  that 
even  for  small  to  moderate  M,  many  classes  of  functions  can  be  closely  fit  by  approximations  of  this 
form[see  Donoho  and  Johnstone  (1985).]  Another  advantage  of  projection  pursuit  approximations 
is  affine  equivariance.  That  is,  the  solution  is  invariant  under  any  nonsingular  affine  transformation 
(rotation  and  scaling)  of  the  original  explanatory  variables.  It  is  the  only  general  method  suggested 
for  practical  use  that  seems  to  possess  this  property.  Projection  pursuit  solutions  have  some  inter- 
prative  value  (for  small  M)  in  that  one  can  inspect  the  functions  fm  and  the  corresponding  linear 
combination  vectors.  Evaluation  of  the  resulting  approximation  is  computationally  fast.  Disadvan¬ 
tages  of  the  projection  pursuit  approach  are  that  there  exist  some  simple  functions  that  require 
large  M  for  good  approximation  [see  Huber  (1985)],  it  is  difficult  to  separate  the  additive  from  the 
interaction  effects  associated  with  the  variable  dependencies,  interpretation  is  difficult  for  large  M, 
and  the  approximation  is  computationally  time  consuming  to  construct. 

Recursive  partitioning  approximations  take  the  form 

(7) 

Here  /(•)  is  0/1  valued  function  that  indicates  the  truth  of  its  argument  tnd  are  disjoint 

subregions  representing  a  partition  of  D.  The  functions  fm  are  generally  taken  to  be  of  quite  simple 
parametric  form.  The  most  common  is  a  constant  function 


M 

■  5  ^n)  ^  ]  fm(.^l  *‘’’i2;n)f[(3^1,’‘‘i3rn)  €  J^m]- 

m  =  l 


ym(^l  >  *  '  *  1  ^n)  —  (6) 

[Morgan  and  Sunquist  (1963)  and  Breiman,  et  al.  (1984)].  Linear  functions  (4)  have  also  been 
proposed  [Breiman  and  Meisel  (1976)  and  Friedman  (1979)],  but  they  have  not  seen  much  use.  The 
partitioning  is  developed  in  a  recursive  manner.  At  each  step,  M,  all  existing  subregions 
are  optimally  split  into  two  subregions  along  one  of  the  variables.  The  particular  split  that  yields 
the  best  improvement  in  the  fit  is  taken  to  define  two  new  regions  and  the  parent  region  (that  was 
split)  is  deleted.  (The  starting  region  is  the  entire  domain  D.)  The  number  of  subregions  in  the 
partition  is  thereby  increased  by  one  at  each  step.  A  backwards  stepwise  strategy  for  determining 
the  final  number  of  regions  is  detailed  in  Breiman,  et  al.  (1984). 

The  recursive  partitioning  approach  has  the  potential  to  provide  acceptable  approximations 
in  high  dimensionalities  provided  the  under’ying  function  has  low  “local”  dimensionality.  That  is, 
even  though  the  function  /  (1)  may  strongly  depend  on  all  of  the  variables,  in  any  local  region  of 
the  domain  the  dependence  is  strong  on  only  a  few  of  them.  These  few  variables  may  be  different 
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in  different  regions.  Another  assumption  inherent  in  the  recursive  partitioning  strategy  is  that 
interaction  effects  have  marginal  consequences.  That  is,  a  local  intrinsic  dependence  on  several 
variables,  when  best  approximated  by  an  additive  function,  does  not  lead  to  a  constant  model. 
This  is  nearly  always  the  case. 

Recursive  partitioning  using  piecewise  constant  aoproximations  (8)  are  fairly  interpretable 
owing  to  the  fact  that  they  are  very  simple  and  can  be  lepresented  by  a  binary  tree.  [See  Breiman 
et  al.  (1984)].  They  are  also  fairly  rapid  to  construct  and  especially  rapid  to  evaluate. 

Although  recursive  partitioning  is  the  most  adaptive  of  the  methods  for  multivariate  function 
approximation  it  suffers  from  some  fairly  severe  restrictions  that  limit  its  effectiveness.  Foremost 
among  these  is  that  the  approximating  function  is  discontinuous  at  the  subregion  boundaries.  This 
is  more  than  a  cosmetic  problem.  It  severely  limits  the  accuracy  of  the  approximation,  especially 
when  the  true  underlying  function  is  continuous.  Even  imposing  continuity  only  of  the  function 
(as  opposed  to  derivatives  of  low  order)  is  usually  enough  to  dramatically  increase  approximation 
accuracy. 

Another  problem  with  recursive  partitioning  is  that  certain  types  of  simple  functions  are  diffi¬ 
cult  to  approximate.  These  include  linear  functions  with  more  than  a  few  nonzero  coefficients  [with 
the  piecewise  constant  approximation  (8)]  and  additive  functions  (5)  in  more  than  a  few  variables 
(piecewise  constant  or  piecewise  linear  approximation).  In  addition,  one  cannot  discern  from  the 
representation  of  the  model  whether  the  approximating  function  is  close  to  a  simple  one,  such  as 
linear  or  additive,  or  whether  it  involves  complex  interactions  among  the  variables. 

2.0.  Multivariate  Adaptive  Regression  Splines. 

This  section  describes  a  new  method  of  adaptive  computation  for  approximating  functions  in 
high  dimensionalities.  Although  it  is  an  extension  of  the  additive  modeling  (5)  procedure  devel¬ 
oped  by  Friedman  and  Silverman  (1987),  it  appears  closest  in  spirit  to  the  adaptive  nature  of  the 
recursive  pa.titioning  approach.  Unlike  recursive  partitioning,  however,  it  produces  strictly  con¬ 
tinuous  approximations  (with  continuous  derivatives  if  desired),  it  easily  approximates  linear  and 
additive  functions,  and  it  can  be  represented  in  a  form  that  permits  separate  identification  of  the 
additive  and  (multiple)  interaction  effects  associated  with  the  variables  that  enter  into  the  model. 

The  approximation  takes  the  form  of  an  expansion  in  multivariate  spline  basis  functions, 

M 

f  ■  ■  ■ »  ^  ‘  ‘ )  (9fl) 

m=0 

with 

Ro(a:i,---,x„)  =  1,  {9b) 

(^l?’*'j^n)  —  f^(^u(lc,m)l^lcm)i  m^l.  (^^) 

k=l 

1  he  {am  }o^  are  the  coefficients  of  the  expansion.  Each  multivariate  spline  baisis  function  Bm-,  m  >  0, 
is  a  product  of  univariate  spline  basis  functions  6,  each  of  a  single  variable  characterized 

by  a  knot  at  tkm-  The  subscripts  v(k,Tri)  label  the  explanatory  variables,  thereby  taking  values 
in  the  range  1  <  v{k,m)  <  n;  Km  takes  values  in  the  same  range  1  <  Km  <  n  and  determines 
the  number  of  factors  (univariate  spline  basis  functions)  comprising  the  corresponding  Bm-  The 
multivariate  spline  basis  functions  Bm  are  adaptive  in  that  the  number  of  factors  A'm,  the  variable 
set  V(m)  =  {n(A:,  m)}(^"  and  the  knot  set  {tkm}^”'  are  all  determined  by  the  data. 

The  approximation  is  developed  in  a  forward/backwards  stepwise  recursive  manner  in  analogy 
with  the  recursive  partitioning  approach.  Given  '  the  A/th  term  takes  the  form 

flA/(X|,-  •  •,X„)  =  •  •  •.Xnlftfx,,!/)  (10) 
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with  0  <  £  <  M  —  1.  That  is,  the  next  term  B\i  is  taken  to  be  the  product  of  a  univariate 
spline  basis  function  with  one  of  the  previously  defined  multivariate  spline  basis  functions  Bi 
(0  <  ^  <  M  —  1).  The  values  for  u,  t,  and  t  are  chosen  so  as  to  jointly  maximize  the  goodness- 
of-fit  of  the  resulting  approximation  (see  Section  2.2).  The  defining  variable  Xy  for  the  new  basis 
function  6(x„|t)  is  restricted  to  be  one  that  does  not  appear  in  the  selected  5^,  so  that  the  same 
variable  does  not  appear  more  than  once  in  any  Bm  (0  <  m  <  M).  The  resulting  optimal  values 
u*,  <*,  and  i'  are  then  used  to  form  the  new  multivariate  spline  basis  function 

Km 

Bm  -  J][ 

k=\ 

with  Km  =  Ki~  +  1,  —  u*,  t^^M  ~  and  the  rest  of  the  factors  taken  from  . 

One  of  the  requirements  for  this  strategy  to  be  computationally  feasible  is  that  each  univariate 
basis  function  be  defined  by  the  location  of  a  single  knot  tkm-  We  therefore  use  the  truncated  power 
basis  representation  for  the  (univariate)  splines 

6<’)(x|0  =  (X  -  <)^  (11) 

where  q  is  the  order  of  the  spline  which  controls  the  degree-of-continuity  of  the  approximation. 
The  subscript  denotes  the  non-negative  part.  (This  basis  is  known  to  produce  numerical  problems, 
especially  for  9  >  1,  so  a  great  deal  of  care  must  be  taken  in  the  implementation.) 

This  forward  stepwise  construction  of  the  multivariate  spline  basis  (9)  (10)  is  continued  until 
M  =  Mmax  terms  have  been  entered  into  the  approximation.  This  process  yields  a  sequence  of 
A/max  models,  each  with  one  more  term  than  the  previous  one  in  the  sequence.  Each  model  in 
the  sequence  has  an  associated  badness-of-fit  score  (see  Section  2.2).  That  model  with  the  lowest 
badness-of-fit  score  is  then  subjected  to  a  backwards  stepwise  deletion  strategy  [see  Friedman  and 
Silverman  (1987),  Section  2.1],  to  obtain  the  final  model.  The  upper  limit  Mmax  should  be  taken  to 
be  large  enough  so  that  the  minimizing  model  is  not  too  close  to  the  end  of  the  sequence.  Due  to 
the  forward  stepwise  nature  of  the  procedure  it  is  possible  for  the  badness-of-fit  to  locally  increase 
a  bit  as  the  sequence  proceeds,  and  then  start  to  decrease  again. 

If  one  makes  the  restriction  Km  =  1  (9c)  for  all  m  (that  is,  always  setting  (  =  0  rather  than 
including  it  in  the  optimization)  the  approximation  becomes  a  sum  of  functions,  each  of  a  single 
variable.  This  is,  of  course,  an  additive  model  (5)  and  this  strategy  reduces  to  the  smoothing  and 
additive  modeling  technique  introduced  by  Friedman  and  Silverman  (1987).  The  key  ingredient 
that  advances  this  approach  to  general  settings  is  the  ability  to  fit  (possibly  complex)  interactions 
among  the  variables  through  the  product  terms  that  are  permitted  to  enter  the  appro.ximation  (9), 
if  required  by  the  fit. 

Although  originally  motivated  by  the  work  of  Friedman  and  Silverman  (1987)  this  approxima¬ 
tion  strategy  (9)-(ll)  has  more  in  common  with  the  recursive  partitioning  approach  (see  Section 
1.0)  to  function  approximation  (7).  There  is  a  correspondence  between  the  terms  in  (9)  and  the 
regions  in  (7).  Choosing  a  previous  term  for  multiplication  (10)  is  analogous  to  choosing  a  (pre¬ 
vious)  region  to  split  in  (7).  The  optimization  over  v  and  t  in  (10)  is  quite  similar  to  finding  the 
optimal  splitting  variable  and  split  point  for  partitioning  a  region. 

The  correspondence  between  this  basic  approach  and  recursive  partitioning  is  most  easily  seen 
by  contrasting  the  piecewise  constant  approximation  (8)  of  the  latter  with  the  use  of  9  =  0  splines 
(11)  in  the  former 

b^^\x\t)  =  I{x  -  t).  (12) 

Both  methods  then  produce  piecewise-constant  approximations  in  this  rase,  and  multiplying  (some¬ 
times  with  constraints)  is  strictly  equivalent  to  splitting.  The  two  methods,  even  though  being  most 
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similar  in  this  setting,  do  not  however  produce  equivalent  approximations.  This  is  basically  because 
unlike  recursive  partitioning,  the  subregions  induced  by  (9),  (10),  (12)  are  not  constrained  to  be 
disjoint.  At  any  stage  during  recursive  partitioning,  only  terminal  regions  are  eligible  for  splitting, 
i.e.  only  those  regions  defined  by  the  intersections  of  previous  splits  (terminal  nodes  on  the  current 
binary  tree).  With  the  MARS  strategy  all  previously  defined  regions  -  not  just  terminal  ones  -  are 
eligible  for  splitting  at  any  stage  of  the  model  building  process.  The  previously  defined  regions  are 
those  represented  by  the  internal  nodes  of  the  tree  and  are  unions  of  subsets  of  current  terminal 
regions. 

The  strategy  associated  with  the  MARS  approach  has  several  important  advantages.  Foremost 
among  them  is  that  it  allows  close  approximations  to  many  of  the  common  functions  that  present 
difficulty  to  recursive  partitioning  (e.g.  nearly  linear  or  additive  functions).  Another  advantage  is  its 
interpretability  through  its  ANOVA  representation  (see  below).  The  most  important  advantage  of 
this  approach,  however,  is  that  by  choosing  q  >  0  (11)  continuous  approximations  can  be  achieved. 
This  has  been  one  of  the  most  serious  limitations  of  recursive  partitioning.  Choosing  a  value  for 
q  >  1  causes  the  approximation  to  be  continuous  and  to  possess  continuous  derivatives  to  order 
q-1. 

As  with  recursive  partitioning,  this  method  attempts  to  use  to  advantage  the  fact  that  inter¬ 
action  effects  involving  several  variables  will  give  rise  to  non-constant  dependencies  on  at  least  one 
of  those  variables  individually.  This  is  because  in  the  forward  part  of  the  model  building  strategy, 
additive  terms  and  lower  order  interactions  must  enter  before  the  corresponding  higher  order  in¬ 
teractions.  These  lower  order  terms  provide  information  as  to  where  to  place  knots  to  capture  the 
corresponding  higher  order  ones,  and  they  may  in  fact  be  removed  (through  the  backwards  deletion 
process)  after  the  higher  order  interaction  terms  are  entered. 

2.1.  ANOVA  Decomposition. 

The  representation  of  the  approximation  given  by  (9),  (10),  (11)  resulting  from  construction 
of  the  model 

/(^l  1  ■  ■  ' )  ^n)  =  ®0  +  “  ^fcm  )+  (1^) 

m=l  fc=:l 

does  not  provide  much  insight  into  the  nature  of  the  approximation.  By  simply  rearranging  the 
terms,  however,  it  is  able  to  provide  considerable  insight  into  the  predictive  relationship  between  y 
and  Xx ,  *  ‘  i 

/(xi,---,x„)  =  ao -f-  ^  /•(x,)+ 

r<„=i  k„=2 

+  •••• 


Here  the  first  sum  is  over  all  terms  involving  only  a  single  variable  and  represents  the  purely  additive 
component  of  the  model.  Each  additive  function  /j(xj)  can  be  computed  by  collecting  together  all 
single  variable  terms  involving  x,, 

X]  (146) 

»n  ~  J 
»€V(m) 

Here  V(m)  represents  the  variable  set  {u(A:,  m)}(^"*  aissociated  with  the  mth  term.  The  second  sum 
in  (14a)  is  over  all  terms  involving  exactly  two  variables  and  represents  the  pure  first  order  (two 
variable)  interaction  part  of  the  model  with 

~  ^  j)-  (14c) 

^  m 
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Similarly,  the  third  sum  represents  second  order  (three  variable)  interactions  with 

K^=3 

(i.J./k)€V(m) 

and  so  on.  The  additive  terms  can  be  viewed  by  plotting  /j(x,)  against  Xi  as  one  does  with  additive 
modeling.  The  two  variable  interaction  terms  fij(xi,Xj)  can  be  plotted  using  either  contour  or 
perspective  mesh  plots.  Higher  order  interactions  (if  present)  are  of  course  more  difficult  to  view. 
The  corresponding  (multivariate)  knot  locations  can,  however,  provide  some  insight.  We  refer  to 
(14)  as  the  ANOVA  decomposition  or  representation  of  the  MARS  model  because  of  its  similarity 
to  decompositions  provided  by  the  analysis  of  variance  of  contingency  tables. 

The  ANOVA  representation  identifies  the  particular  variables  that  enter  into  the  model, 
whether  they  enter  purely  additively  or  are  involved  in  interactions  with  other  variables,  the  order 
of  the  interactions,  and  the  other  variables  that  participate  in  them. 

2.2.  Model  Selection. 

As  in  Friedman  and  Silverman  (1987)  we  use  the  generalized  cross-validation  criterion  (Craven 
and  Wahba,  1979) 

1  ^  ( A/f  ^ ^ 

GCV{M)  =  -Y^[yi-fMixu,---,Xni)?/  1  - (15a) 

t=l  ^ 

for  model  selection  where  M  is  the  number  of  terms  in  (9a)  and 

C(M)  =  (d+ 1)M  +  1.  (156) 

Minimization  of  this  criterion  is  used  to  select  the  knot  variable  and  its  location  at  each  forward 
step,  the  terms  to  delete  in  the  backwards  steps,  and  the  size  of  the  final  model.  The  use  of  (15b) 
results  in  a  change  of  (d  -I-  1)  “degrees-of- freedom”  for  each  term  in  the  model,  one  for  fitting 
the  least-squares  coefficient  am,  and  d  for  the  optimization  associated  with  the  knot  placement. 
Friedman  and  Silverman  (1987)  used  d  =  2.  This  was  motivated  somewhat  on  theoretical  grounds 
but  mostly  on  an  empirical  basis.  This  value  is  too  small  for  generalized  MARS  modeling  since  we 
are,  in  addition,  optimizing  over  the  term  index  0<f<M  —  lat  each  step  as  well  as  the  knot 
location.  This  produces  increased  variance  that  must  be  accounted  for  in  the  model  selection.  A 
direct  approach  would  be  to  estimate  an  optimal  d  value  for  the  problem  at  hand  through  a  sample 
reuse  technique  such  as  the  632  bootstrap  (Efron,  1983)  or  cross-validation  Stone  (1974). 

Another  approach  is  to  study  the  variance  directly  through  a  modified  bootstrapping  technique 
(Hastie  and  Tibshirani,  1985).  Each  bootstrap  replication  consists  of  replacing  each  response  value 
by  a  standard  normal  deviate.  By  construction  the  true  underlying  function  /  is  the  constant  zero, 
and  the  mean-squared-prediction  error  is  completely  dominated  by  the  variance 

E{f  -  Im?  =  Efl,  =  Var 

or  equivalently 

E(y-  =  Eflf  +  l.  (16) 

Since  the  GCV  score  (15a)  is  intended  to  be  an  estimate  for  (16)  one  can  obtain  an  estimate  for 
C{M)  through 

E{ASRm)I  =Efl,^l 
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or 


C{M)  =  N 


fE(ASRM)y^^ 

KEfl  +  l  J 


(17). 


Here  the  average-squared-residual,  ASRm,  is  the  numerator  in  (15a).  The  expected  values  in  (17) 
are  estimated  through  repeated  bootstrap  replications. 

A  wide  variety  of  simulation  studies  (not  detailed  here)  using  this  approach  indicate  the  fol¬ 
lowing. 

(1)  C{M)  is  a  monotonically  increasing  function  with  decreasing  slope  as  M  increases. 

(2)  Using  the  linear  approximation  (15b),  with  d  =  2.5,  is  fairly  effective,  if  somewhat  crude. 

(3)  The  “best”  value  for  d  depends  (weakly)  on  M,  N,  and  the  distribution  of  the  covariate  vectors. 

(4)  Over  a  wide  variety  of  situations,  the  best  value  of  d  lies  in  the  range  2.0  <  d  <  3.0. 

(5)  The  actual  accuracy  of  the  approximation,  in  terms  of  integrated  squared  error 


fSE  =  J [/(x)  -  fix)fdFix), 

depends  very  little  on  the  value  chosen  for  d  in  the  range  2.0  <  d  <  3.0. 

(6)  The  estimated  accuracy 

E[ISE-GCV(M*)]\ 

with  M*  being  the  minimizer  of  (15),  does  show  a  moderate  dependence  on  the  choice  of  d. 
The  consequence  of  (5)  and  (6)  is  that,  although  how  well  one  is  doing  with  this  approach  is  fairly 
independent  of  d,  how  well  one  thinks  he  is  doing  (based  on  the  optimizing  GCV  score)  does 
depend  somewhat  on  the  values  chosen  for  d.  Therefore,  a  sample  reuse  technique  should  be  used 
to  estimate  the  predictive  capability  of  the  final  model,  if  it  needs  to  be  known  fairly  precisely. 
2.3.  Degree-of-Continuity. 

Another  important  choice  is  the  degree  of  continuity  to  be  imposed  on  the  approximating 
function,  i.e.  the  value  for  q  in  (11).  This  choice  affects  the  accuracy  of  the  approximation,  and 
the  speed  and  numerical  stability  of  the  computation.  Friedman  and  Silverman  (1987)  used  9  =  1 
in  conjunction  with  the  knot  placement  and  model  selection  strategy.  This  produces  a  continuous 
piecewise  linear  approximation  with  discontinuous  derivatives.  Advantages  of  this  approach  are 
much  more  rapid  and  numerically  stable  computation  compared  to  higher  values  of  q.  Also,  it  can 
provide  more  accurate  approximations  in  some  situations.  The  main  disadvantage  is  discontinuous 
first  derivatives. 

Friedman  and  Silverman  (1987)  provide  for  derivative  smoothing  by  replacing  the  basis  func¬ 
tions  6^^^(x|t)  (11)  by  closely  related  ones  with  continuous  first  derivatives: 


C{x\t. 


with  <_<<<<+.  Setting 


r  0  X 

=  i  p{x  -  t_)^  -I-  r(x  -  <_)^  t. 
1  X  -  <  X 

<  <- 

.  <  X  <t 

>4 

p=  (24  -t-t-  -3<)/(4  -<-)^ 

r  =  (2<-4-t_)/(4-<_)" 

(18a) 


(186) 


causes  these  basis  functions  to  be  continuous  and  have  continuous  first  derivatives.  This  approx¬ 
imation  has  discontinuous  second  derivatives  at  the  side  knot  locations,  t_  and  The  central 
knot  t,  is  placed  at  the  corresponding  knot  location  of  6^*l(x|t).  The  two  side  knots,  t_  and  are 
placed  at  the  midpoints  between  adjacent  central  knots  on  the  same  variable  thereby  minimizing 
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the  number  of  second  derivative  discontinuities.  The  (central)  knots  are  placed  using  the 
(11)  basis,  taking  advantage  of  the  corresponding  speed  and  numerical  stability.  The  approxima¬ 
tion  with  continuous  derivatives  is  accomplished  through  using  the  corresponding  piecewise  cubic 
basis  (18). 

The  analogue  to  this  approach  in  the  more  general  setting  of  MARS  modeling  is  to  perform 
derivative  smoothing  in  the  ANOVA  representation  (14).  Each  distinct  ANOVA  function  (14b), 
(14c),  (14d),  etc.  is  smoothed  separately.  The  side  knots  are  placed  at  the  midpoints  between 
the  central  knot  locations  as  projected  onto  each  variable  defining  the  particular  function.  For 
the  additive  ANOVA  functions  (14b)  this  of  course  reduces  to  the  Friedman  and  Silverman  (1987) 
strategy.  Replacing  each  6^^^(a:|t)  (11)  by  its  corresponding  C{x\t-,t,t+)  (18)  in  the  MARS  model 
(13)  (14)  results  in  a  continuous  approximation  with  everywhere  continuous  derivatives. 

2.4.  Knot  Optimization. 

A  natural  strategy  would  be  to  make  each  distinct  observation  abscissa  value  on  each  predictor 
variable  a  potential  location  for  knot  placement.  Friedman  and  Silverman  (1987)  argue  that  a 
more  effective  strategy  is  to  restrict  the  number  of  candidate  knot  locations  to  very  ith  (distinct) 
observation  abscissa  value,  with  L  given  by 


L{p,N)=  -logj 


/2.5 


(19) 


and  0.05  <  a  <  0.01.  The  considerations  that  lead  to  this  result  do  not  change  when  one  considers 
the  more  general  MARS  setting. 

2.5.  Computational  Considerations. 

In  order  for  any  method  to  be  practical  it  must  be  computationally  feasible.  If  implemented 
in  a  straightforward  manner  the  approximation  strategy  we  propose  would  require  prohibitive 
computation.  A  full  M  +  I  parameter  linear  least  squares  fit  for  he  coefficients  {om}^  must 
be  performed  to  evaluate  the  model  selection  criterion  (15).  This  must  be  done  at  every  potential 
knot  location  on  every  variable  for  all  M  (previous)  terms  at  each  step  M .  The  only  way  this  can 
be  made  to  be  computationally  feasible  is  through  updating  formulae.  That  is,  given  the  solution 
fit  at  one  potential  knot  location,  the  solution  at  the  next  one  can  be  obtained  through  rapidly 
computable  simple  updates  of  the  previous  solution.  Friedman  and  Silverman  (1987,  Section  2.3) 
derived  updating  formulae  for  the  quantities  that  enter  into  the  normal  equations  of  the  least  squares 
fit  for  the  additive  modeling  case.  Analogous  updating  formulae  can  be  derived  for  the  more 
general  case  of  MARS  modeling.  Use  of  these  updating  formulae  reduce  the  computation  from 
being  proportional  to  M*pN'^fL  to  M^pNfL.  As  a  point  of  reference,  the  computation  for  the 
three  examples  (Section  3)  each  required  about  two  minutes  on  a  SUN  Microsystems  model  3/260. 

3.0.  Examples. 

This  section  provides  four  illustrations  of  MARS  modeling.  The  data  are  simulated  so  that 
the  results  can  be  compared  with  the  known  (generated)  truth.  The  first  and  fourth  examples  are 
purely  contrived,  whereas  the  middle  two  are  taken  form  electrical  engineering.  In  all  examples 
the  smoothing  parameter  d  (15b)  was  taken  to  be  d  =  2.5.  (The  software  automatically  reduces 
it  to  da  =  0.8d  =  2.0  for  additive  modeling.)  The  minimum  number  of  observations  between  knot 
locations  was  determined  by  (19).  In  all  examples  the  explanatory  variables  were  standardized  to 
aid  in  numerical  stability.  (The  MARS  procedure  is,  except  for  numerics,  invariant  to  the  predictor 
variable  scales.)  The  response  variable  was  also  standardized  so  that  the  GCV  score  would  be  an 
estimate  for  the  fraction  of  unaccounted  for  variance  (e^  =  1  -  R^). 

3.1.  Simple  Function  of  Ten  Variables. 
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For  this  example,  N  =  100  covariate  vectors  were  uniformly  generated  in  a  n  =  10  dimensional 
unit  hypercube.  Associated  with  each  such  covariate  vector  is  a  response  value  generated  as 

yi  =  0.02e'‘^*‘‘''^^’‘  +  5sin(7rx3i/2) 

+  3x4i  +  2a!5i  +  0  -  X6i  +  0  •  a;7<  +  0  ■  X8<  (20) 

+  0  •  Xg,  +  0  •  Xio.i  +  1  <  I  <  100, 

with  the  €j  generated  from  a  standard  normal  distribution.  The  ratio  of  standard  deviations  of  the 
signal  to  the  noise  is  3.08  so  that  the  true  underlying  function  accounts  for  91%  of  the  variance  of 

y- 

The  underlying  function  (20)  consists  of  an  interaction  in  the  first  two  variables,  an  additive 
nonlinear  dependence  in  the  third,  and  linear  dependencies  in  the  fourth  and  fifth.  The  last  five, 
xe  —  Xio,  are  pure  noise  variables  independent  of  the  response. 

Table  1  displays  the  results  of  applying  the  MARS  procedure  to  these  data.  Table  la  shows 
the  history  of  the  forward  stepwise  knot  placement.  The  second  column  gives  the  GCV  score  (15) 
at  each  iteration  M  (first  column).  The  third  column  shows  the  effective  number  of  parameters  in 
the  fit  C{M)  (15b).  The  fourth  and  fifth  columns  give  the  optimizing  knot  variable  v*  and  location 
t*,  while  the  last  column  points  to  the  optimizing  previous  term  (multivariate  spline  basis  function) 
£*  that  multiplies  the  new  univariate  spline  function.  This  term  may  in  fact  point  to  previous  terms 
for  its  definition.  The  value  £*  =  0  indicates  that  the  previous  multiplying  term  is  Bq  (9b)  so  that  a 
new  purely  additive  term  is  being  included  in  the  model.  The  particular  factors  comprising  the  Mth 
multivariate  spline  basis  function  are  identified  by  starting  with  the  Mth  row,  then  preceeding  to 
its  parent,  then  to  its  parent’s  parent  and  so  on,  until  reaching  a  parent  value  of  £*  =  0. 

Table  la  shows  that  the  first  knot  was  placed  on  xi.  The  second  knot  was  placed  on  X2, 
multiplying  the  first  term.  At  this  point  (M  =  2)  the  model  consists  of  an  additive  contribution 
on  xi  and  an  interaction  between  xi  and  X2.  The  next  three  iterations  include  purely  additive 
contributions  form  X3,  X4,  and  X5.  The  next  iteration  (M  =  6)  includes  an  additive  term  in  X2. 
This  is  multiplied  by  a  factor  involving  Xi  on  the  subsequent  iteration  (M  =  7),  resulting  in  two 
bivariate  splines  characterizing  the  interaction  between  Xj  and  X2.  Up  to  this  point  the  GCV  score 
has  been  monotonically  decreasing. 

The  eighth  iteration  places  into  the  model  a  term  involving  an  interaction  between  variables  xg, 
X2,  and  Xi.  Note,  however,  that  the  GCV  score  has  increased  slightly.  As  more  terms  are  added, 
the  GCV  score  continues  to  increase  until  the  present  maximum  number  of  terms  Mmax  =  17,  is 
reached. 

Table  lb  shows  the  result  of  the  backwards  stepwise  term  deletion  strategy.  The  first  column 
gives  the  term  number,  m,  the  second  its  least  squares  coefficient,  am  (9a),  followed  by  the  knot 
variable,  location,  and  parent  as  in  Table  la.  A  zero  coefficient  value,  am  =  0,  means  that  the 
term  has  been  deleted.  Note  that  in  addition  to  the  deletion  of  all  terms  beyond  M  =  7,  the 
purely  additive  contributions  of  variables  Xi  and  X2  (first  and  sixth  terms)  have  also  been  deleted. 
This  leaves  only  the  two  terms  (second  and  seventh)  involving  pure  interactions  between  these  two 
variables. 

Table  Ic  summarizes  the  ANOVA  decomposition  of  the  final  model.  There  are  four  ANOVA 
functions.  The  first  three  are  additive  functions  on  variables  xg,  X4,  and  X5  respectively.  The  fourth 
ANOVA  function  is  bivariate  and  represents  a  (pure)  interaction  between  xi  and  xg.  Table  Ic  also 
gives  the  GCV  score  for  the  fit  with  the  corresponding  piecewise  cubic  basis  (18).  It  is  seen  to  be 
essentially  the  same  as  for  the  piecewise  linear  basis  given  in  Table  lb. 

The  second  column  in  Table  Ic  gives  the  standard  deviation  of  the  corresponding  ANOVA 
function.  This  gives  one  indication  of  its  (relative)  importance  to  the  model  and  is  interpreted  in  a 
manner  similar  to  a  (standardized)  regression  coefficient  in  a  linear  model.  The  third  column  gives 
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Table  la 

History  of  the  MARS  forward  stepwise  knot  placement  strategy 

for  Example  3.1. 


iter. 

gcv 

^  efprms 

variable 

knot 

parent 

1 

0.8460 

4.5 

1. 

0.5257 

0. 

2 

0.5781 

8.0 

2. 

-0.6736 

1. 

3 

0.3914 

11.5 

3. 

-1.626 

0. 

4 

0.2885 

15.0 

4. 

-1.170 

0. 

5 

0.2347 

18.5 

5. 

-1.601 

0. 

6 

0.1911 

22.0 

2. 

-1.177 

0. 

7 

0.1599 

25.5 

1. 

-1.164 

6. 

8 

0.1603 

29.0 

9. 

-1.128 

2. 

9 

0.1621 

32.5 

3. 

-0.9315 

0. 

10 

0.1696 

36.0 

4. 

1.015 

1. 

11 

0.1802 

39.5 

3. 

1.013 

0. 

12 

0.1829 

43.0 

6. 

-0.2161 

11. 

13 

0.1936 

46.5 

4. 

-1.675 

5. 

14 

0.2062 

50.0 

4. 

0.2366e-01 

11. 

15 

0.2271 

53.5 

9. 

1.583 

3. 

16 

0.2519 

57.0 

9. 

-0.2349 

5. 

17 

0.2837 

60.5 

2. 

-0.4146 

5. 
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Table  lb 

The  result  of  the  backwards  stepwise  term  deletion  strategy 
for  Example  3.1. 

gcv  =  0.1404  i^efprms  =  18.5 


term 

coeff. 

variable 

knot 

parent 

1 

0. 

1. 

0.5257 

0. 

2 

0.8746 

2. 

-0.6736 

1. 

3 

0.4525 

3. 

-1.626 

0. 

4 

0.3171 

4. 

-1.170 

0. 

5 

0.2232 

5. 

-1.601 

0. 

6 

0. 

2. 

-1.177 

0. 

7 

0.2373 

1. 

-1.164 

6. 

8 

0. 

9. 

-1.128 

2. 

9 

0. 

3. 

-0.9315 

0. 

10 

0. 

4. 

1.015 

1. 

11 

0. 

3. 

1.013 

0. 

12 

0. 

6. 

-0.2161 

11. 

Table  Ic 

ANOVA  decomposition  summary  of  the  MARS  model  for  Example  3.1 


fun. 

std.  dev. 

-gcv 

#  terms 

#  efprms  variable(s) 

1 

0.4518 

0.4109 

1 

3.5 

3 

2 

0.2983 

0.2520 

1 

3.5 

4 

3 

0.2229 

0.1974 

1 

3.5 

5 

4 

0.7772 

0.8867 

2 

7.0 

1  2 

piecewise 

cubic  fit  on  5  terms, 

gcv  = 

0.1457 
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Figure  lb:  Enlargement  of  the  fourth  frame  of  Figure  la;  interaction  contribution  of  ( 
to  the  MARS  model  for  Example  3.1. 


another  indication  of  the  importance  of  the  corresponding  ANOVA  function,  by  providing  the  GCV 
score  for  the  model  with  all  of  the  terms  corresponding  to  that  particular  ANOVA  function  deleted. 
This  can  be  used  to  judge  whether  this  ANOVA  function  is  making  an  important  contribution 
to  the  model,  or  whether  it  just  slightly  improves  the  global  GCV  score.  In  this  example  all  four 
ANOVA  functions  appear  to  be  important  with  the  third  one,  involving  xs,  being  the  weakest. 

Figure  la  provides  a  pictorial  representation  of  the  ANOVA  decomposition  by  plotting  the 
respective  (piecewise-cubic)  ANOVA  functions.  The  first  three  frames  plot  the  respective  additive 
functions  involving  X3,  X4,  and  X5.  The  fourth  frame  provides  a  perspective  mesh  plot  of  the 
bivariate  ANOVA  function  involving  xi  and  X2.  Figure  lb  is  an  enlargement  of  the  fourth  frame 
of  Figure  la. 

These  figures  show  very  nearly  linear  dependencies  on  X3,  X4,  and  X5,  and  a  strong  nonlinear 
interaction  between  xi  and  X2.  It  is  important  to  note  that  Figure  lb  does  not  represent  a  smooth  of 
the  response  y  on  variables  Xi  and  X2,  but  rather  it  shows  the  contribution  of  xi  and  X2  to  a  smooth 
of  y  on  variables  Xi,---,xio.  The  accuracy  of  the  resulting  approximation  is  fairly  remarkable 
considering  the  high  dimensionality,  n  =  10,  and  the  small  sample  size,  N  =  100.  Note  also  that 
the  procedure  (correctly)  did  not  enter  xg,  •  •  •  ,xio  into  the  model. 

The  only  shortcoming  of  the  MARS  model  based  on  these  data  is  that  it  did  not  capture  the 
nonlinearity  in  the  additive  contribution  of  X3  (20).  Figure  Ic  shows  the  pictorial  representation 
of  the  ANOVA  decomposition  corresponding  to  Figure  la  when  the  sample  size  is  increased  to 
N  =  200.  The  model  looks  very  similar  to  that  for  the  smaller  {N  =  100)  sample  size  (Figure  la) 
except  that  it  now  gives  a  better  approximation  to  the  contribution  of  X3. 

Tables  la  -  Ic  and  Figures  la  -  lb  illustrate  the  application  of  the  MARS  procedure  to 
a  single  data  set  (replication)  from  the  particular  setting  under  study  (20).  They  do  not  give 
information  on  the  average  performance  of  the  procedure  when  applied  to  this  situation.  Table  Id 
displays  the  results  of  a  simulation  study  that  addresses  this  issue.  Each  row  summarizes  the  results 
of  100  replications  of  the  following  procedure.  A  sample  of  N  ten-dimensional  covariate  vectors 
were  randomly  sampled  from  a  uniform  distribution  in  [0, 1]*°.  A  sample  of  N  random  standard 
normal  deviates  were  then  generated  and  the  corresponding  response  values  (20)  were  assigned  to 
the  covariate  vectors.  The  MARS  procedure  was  then  applied.  A  new  data  set  of  5000  observations 
was  then  generated  and  used  to  estimate  the  normalized  integrated  squared  error 

=  / [/(x)  -  /(x)pd^°x/Varx/(x),  (21a) 

and  the  normalized  predictive  squared  error 

PSE  =  (ISE  ■  Varx/(x)  -f  l)/(Varx/(x)  4  1)  (216) 

(fraction  of  unaccounted  for  variance)  for  the  piecewise  cubic  MARS  model. 

The  second  column  of  Table  Id  gives  the  optimizing  GCV  score  averaged  over  the  100  repli¬ 
cations,  whereas  the  third  and  fourth  columns  give  the  corresponding  average  PSE  and  ISE  (21) 
respectively.  The  quantities  in  parentheses  are  the  associated  standard  deviations  over  the  100 
replications.  (The  standard  deviations  of  the  averages  are  one  tenth  these  values.) 

Table  Id  shows  results  for  three  sample  sizes  ( N  =  50, 100, 200)  and  for  three  sets  of  constraints 
applied  to  the  MAR.S  model.  These  constraints  involve  the  maximum  number  of  factors  mi  that 
are  permitted  to  enter  a  single  multivariate  spline  basis  function.  This  controls  the  maximum 
interaction  order  permitted  in  the  model.  Setting  mi  =  1  restricts  the  model  to  be  additive  in 
the  predictor  variables,  whereas  mi  =  2  limits  the  model  to  interactions  involving  at  most  two 
variables,  and  so  on.  The  value  mi  =  n  results  in  no  restriction.  Limiting  the  interaction  level  of 
the  MARS  model  can  improve  accuracy  (reduce  variance)  if  the  true  underlying  function  /  is  close 
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Table  Id 

Summary  of  100  replications  of  Example  3.1,  piecewise  cubic  fit. 


mi 

GCV 

PSE 

ISE 

N  -  50: 

1 

.46  (.12) 

.45  (.097) 

.40  (.11) 

2 

.28  (.13) 

.28  (.18) 

.22  (.20) 

10 

.27  (.11) 

.30  (.19) 

.24  (.21) 

N  =  100: 

1 

.36  (.072) 

.36  (.064) 

.30  (.070) 

2 

.15  (.043) 

.14  (.026) 

.059  (.029) 

10 

.15  (.047) 

.16  (.041) 

.077  (.044) 

N  =  200: 

1 

.32  (.037) 

.31  (.022) 

.25  (.023) 

2 

.12  (.029) 

.12  (.015) 

.033  (.015) 

10 

.12  (.029) 

.12  (.024) 

.041  (.025) 

Table  2a 

ANOVA  decomposition  summary  of  the  MARS  model 
on  alternating  current  series  circuit  impedence,  Z. 


gcv  = 

=  0.2311 

#e/pr 

ms  =  46.5 

fun. 

std.  dev. 

-gcv 

#  terms 

#  efprms 

variable(s) 

1 

0.5096 

0.6392 

1 

3.5 

1 

2 

1.833 

0.6854 

3 

10.5 

2 

3 

1.417 

0.6431 

3 

10.5 

4 

4 

0.4195 

0.4401 

1 

3.5 

2  3 

5 

2.034 

0.5704 

4 

14.0 

2  4 

6 

0.1702 

0.2577 

1 

3.5 

3  4 

piecewise 

cubic  fit 

on  13  terms,  gcv  =0.2447 
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to  an  /  that  involves  at  most  low  order  interactions.  If  not,  such  a  limitation  will  introduce  some 
bias  in  exchange  for  the  corresponding  variance  reduction.  In  terms  of  interpretability  there  is  a 
strong  advantage  to  models  with  mi  =  2,  owing  to  their  graphical  representation  by  means  of  the 
.4N0VA  decomposition. 

In  terms  of  ISE  (21a)  the  accuracy  of  the  MARS  model  for  this  problem  is  seen  to  increase 
rapidly  as  the  sample  size  increases  from  50  to  200.  The  additive  model  (mi  =  1)  is  seen  to  be 
distinctly  inferior  to  those  involving  interactions  (mi  =  2, 10)  especially  as  the  sample  size  increases. 
The  optimizing  GCV  score  is  seen  very  slightly  to  overestimate  the  true  PS E  on  average. 

The  true  underlying  function  (20)  in  this  case  happens  to  involve  at  most  interactions  in  two 
variables.  Thus,  setting  mi  =  2  results  here  in  no  increase  in  bias.  Owing  to  the  decrease  in 
variance,  the  ISE  is  seen  to  be  somewhat  better  than  for  the  unrestricted  MARS  model  (mi  =  10). 
The  size  of  the  effect  is  seen,  however,  to  be  fairly  small  (<  25%  in  squared  error  loss)  so  that  a 
large  penalty  is  not  incurred  by  fitting  the  full  nonparametric  model. 

3.2.  Alternating  Current  Series  Circuit. 

Figure  2a  shows  a  schematic  diagram  of  a  simple  alternating  current  series  circuit  involving  a 
resistor  iZ,  inductor  L,  and  capacitor  C.  Also  in  the  circuit  is  a  generator  that  places  a  voltage 


^ab  =  VoSmujt 


(21g) 


across  the  terminals  a  and  b.  Here  u)  is  the  angular  frequency  which  is  related  to  the  cyclic  frequency 
/  by 

u>  =  2irf.  (2lZ>) 

The  electric  current  lab  that  flows  through  the  circuH  is  also  sinusoidal  with  the  same  frequency, 

lab  =  {yo/Z)s\n{(jjt  -  4>).  (21c) 

Its  amplitude  is  governed  by  the  impedance  Z  of  the  circuit  and  there  is  a  phase  shift  <^,  both 
depending  on  the  components  in  the  circuit: 

Z  =  Z{R,u,L,C), 

<t>  =  0(R,u;,  L,C). 


From  elementary  physics  one  knows  that 


Z{R,u;,L,C)  =  [R^  +  (wL  -  \luCfyl\ 


(22a) 


4>{R,uj,  L,C)  =  tan  ' 


-  \lu>C 
R 


The  purpose  of  this  exercise  is  to  see  to  what  extent  the  M,4RS  procedure  can  approximate  these 
functions  and  perhaps  yield  .some  insight  into  the  variable  relationships,  in  the  range 

Xi :  0  <  R  <  100  ohms 
X2 :  20  <  /  <  280  hertz 

-  ^  -  (23) 

X3 :  0  <  L  <  1  henries 

X4  :  1  <  C  <  1 1  micro  farads. 

Two  hundred  four-dimensional  uniform  covariate  vectors  were  generated  in  the  ranges  (23).  For 
each  one,  two  responses  were  generated  by  adding  normal  noise  to  (22a)  and  (22b).  The  variance 
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Figure  2a:  Schematic  diagram  of  the  alternating  current  series  circuit  of  Example  3.2. 
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of  the  noise  was  chosen  to  give  a  3  to  1  signal  to  noise  ratio  for  both  Z  (22a)  and  <j)  (22b),  thereby 
causing  the  true  underlying  function  to  account  for  90%  of  the  variance  in  both  cases. 

3.2.1.  Impedance,  Z. 

Applying  the  MARS  procedure  to  the  in  pedance  data  with  mi  t  1  (additive  model)  gave  an 
optimizing  GCV  score  of  0.558.  The  GCV  scores  for  mi  =  2  and  4  were  respectively  0.231  and 
0.229.  The  additive  model  is  seen  (not  surprisingly)  to  be  inadequate.  Perhaps  more  surprising  is 
the  fact  that  even  though  the  true  u  iderlying  function  (22a)  contains  interactions  to  all  orders,  an 
approximation  involving  only  two- variable  interactions  is  seen  to  give  nearly  as  good  a  fit  to  these 
data.  Owing  to  its  increased  interpretability  we  show  the  results  of  the  mi  =  2  jnodel. 

Table  2a  shows  the  ANOVA  decomposition  in  the  same  format  as  Table  Ic.  There  is  a  purely 
additive  contribution  from  a:i(jR),  additive  contributions  from  X2{u>)  and  X4{C),  and  interactions 
amongst  X2,  X3{L),  and  X4.  Of  the  six  ANOVA  functions,  all  but  the  last  one  (involving  an 
interaction  between  the  capacitance  C  and  the  inductance  L)  seem  important  to  the  model.  Figure 
2b  displays  a  graphical  representation  of  the  ANOVA  decomposition.  The  first  frame  plots  the 
(additive)  contribution  from  the  resistance  R.  The  next  three  frames  display  the  contributions  of 
the  remaining  variables  that  participate  in  interactions.  These  perspective  mesh  plots  show  the 
total  (additive  plus  interaction)  contributions  of  each  such  variable  pair.  For  example,  the  frame 
in  the  upper  right  corner  plots  the  sum  of  the  second  and  fourth  ANOVA  functions,  whereas  that 
of  the  lower  left  plots  the  suni  of  the  second,  third,  and  fifth. 

The  plots  have  been  rotated  so  as  to  provide  the  best  perspective  view.  The  indicated  zero 
marks  the  lowest  value  and  the  axis  label  marks  the  direction  of  higher  values. 

The  dependence  of  the  impedance  Z  on  R  (first  frame)  is  estimated  to  be  approximately  linear. 
For  low  frequencies  w,  Z  is  seen  to  be  high  and  independent  of  L  (upper  right  frame).  For  high  u>,  Z 
has  a  mild  monotonically  increasing  dependence  on  L.  For  low  L,  Z  monotonically  decreases  with 
increasing  w,  whereas  for  high  L  values,  the  impedance  is  seen  to  achieve  a  minimum  for  moderate  u 
values.  The  lower  left  frame  shows  that  Z  is  very  small  and  roughly  independent  of  w  and  C  except 
when  they  jointly  have  very  small  values,  in  which  case  the  impedance  increases  dramatically.  The 
lower  right  frame  of  Figure  2b  shows  that  the  C,  L  joint  contribution  is  nearly  additive,  consistent 
with  the  weak  contribution  of  the  sixth  ANOVA  function  (Table  2a)  to  the  MARS  model. 

These  interpretations  are  based  on  visual  examination  of  the  graphic  representation  of  the 
ANOVA  decomposition  of  the  MARS  approximation,  ba.sed  on  a  sample  of  size  N  =  200.  Since  the 
data  in  this  case  are  generated  from  known  truth  one  can  examine  the  generating  equation  (22a) 
to  verify  their  general  correctness. 

Table  2b  summarizes  the  results  of  a  simulation  study  based  on  100  replications  of  data  ran¬ 
domly  drawn  according  to  the  above  prescription  (22a),  (23),  in  the  same  format  as  Table  Id.  The 
MARS  procedure  applied  to  the  smallest  sample  size.  A'  =  100.  is  seen  to  provide  a  fairly  poor 
approximation  on  average  in  terms  of  ISE.  The  approximation  accuracy  improves  substantially 
with  the  larger  samples,  except  for  additive  modeling  (mi  =  ] ).  The  approximation  accuracy  for 
the  constrained  (mi  =  2)  models  is  (on  average)  nearly  identical  to  the  unconstrained  (mi  =  4) 
ones.  It  appears  that  the  bias-variance  trade-off  is  exactly  off-.setting  in  this  case. 

The  average  GCV  score  is  seen  to  underestimate  the  corresponding  PSF.  at  the  smallest  sample 
size.  This  is  due  to  the  sharp  joint  dependence  of  Z  on  u;  and  C  [see  (22a)  and  Figure  2,  third  frame]. 
For  small  sample  sizes  most  replications  will  fail  to  sample  covariate  vectors  with  very  small  joint 
values  for  uj  and  C ,  thereby  failing  to  capture  the  rapid  variation  of  Z  in  that  region.  There  is  no 
way  that  the  GCV  score  (based  on  the  ASR)  can  detect  rapid  function  variation  where  there  is  no 
data.  Note  that  sample  reuse  techniques  such  as  cross-validation  or  bootstrapping  liave  the  same 
problem.  As  the  sample  size  increases  enough  data  is  sampled  in  this  region  and  the  GC\'  score 
gives  a  more  accurao-  estimate  of  the  true  PSE  (on  average). 

3.22.  Phase  Angle,  (p. 


.t3 


Table  2  b 

Summary  of  100  replications  of  the  alternating  current  series 
circuit  impedance,  Z,  piecewise  cubic  fit. 


mi 

GCV 

PSE 

ISE 

N  =  100: 

1 

.65  (.12) 

.71  (.092) 

.68  (.10) 

2 

.46  (.15) 

.52  (.19) 

.46  (.21) 

4 

.45  (.15) 

.52  (.19) 

.47  (.21) 

N  =  200: 

1 

.60  (.082) 

.62  (.050) 

.58  (.056) 

2 

.27  (.064) 

.27  (.10) 

.20  (.11) 

4 

.28  (.066) 

.28  (.091) 

.20  (.11) 

N  =  400: 

1 

.57  (.049) 

.57  (.026) 

.52  (.029) 

2 

.20  (.057) 

.18  (.050) 

.095  (.056) 

4 

.20  (.035) 

.18  (.035) 

.092  (.038) 

Table  3a 

ANOVA  decompositi 

on  of  the  MARS  model 

on 

the  alternating  current 

series  circuit  phase  angle,  4>- 

gcv 

=  0.2190 

if^efprms 

=  39.5 

fun. 

std  dev. 

-gcv  ^  terms  #  efprms  variable(s) 

1 

0.6323 

0.3257 

1 

3.5  2 

2 

0.7253 

0.4180 

2 

7.0  4 

3 

0.9931 

0.3041 

1 

3.5  1 

4 

0.6483 

0.4015 

2 

7.0  2 

3 

5 

0.1521 

0.2254 

1 

3.5  2 

4 

6 

0.7754 

0.2662 

2 

7.0  1 

4 

7 

0.2064 

0.2248 

1 

3.5  1 

3 

8 

0.3464 

0.24.58 

1 

3.5  1 

2 

piecewise  cubic  fit  on  11  terms,  gcv  =  0.2393 
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The  MARS  procedure  applied  to  the  phase  angle  data  (22b)  (23)  with  mi  =  1,2,  and  4  gave 
optimizing  GCV  scores  of  0.295,  0.219,  and  0.203,  respectively.  Here  the  additive  model,  while 
still  being  less  accurate,  is  more  competitive  with  those  involving  interactions.  The  two  variable 
interaction  model  again  fits  the  data  almost  as  well  as  the  unconstrained  model. 

Table  3a  summarizes  the  ANOVA  decomposition  for  the  mi  =  2  MARS  model.  It  involves 
additive  contributions  from  all  but  Xi{L)  and  Interactions  among  all  variable  pairs  except  C  and  L. 
Two  of  the  ANOVA  functions  (fifth  and  seventh)  however  are  seen  to  make  very  weak  contributions 
to  the  final  model.  Figure  2c  is  a  graphical  representation  of  the  ANOVA  decomposition  in  the 
same  format  as  Figure  2b.  The  dependence  of  the  phase  angle  <j)  on  all  of  the  variables  is  seen  to  be 
more  gentle  and  more  nearly  additive  than  the  impedance  Z  (Figure  2b).  The  principal  interaction 
effect  is  to  decrease  the  phase  angle  for  simultaneously  high  values  of  the  predictor  variable  pairs. 

Table  3b  gives  the  results  of  100  replications  of  phase  angle  data  generated  according  to  (22b), 
(23).  At  the  smallest  sample  size  {N  =  100)  the  additive  model  produces  fits  that  (on  average)  are 
nearly  as  accurate  as  those  involving  interactions.  For  the  larger  samples  the  interaction  models 
are  somewhat  more  accurate  in  terms  of  /SR.  The  average  optimizing  GCV  score  is  seen  to  be 
quite  close  to  the  true  average  PSE. 

3.3.  Additive  Data. 

In  the  preceding  examples  there  were  strong  interaction  effects  and  it  was  seen  that  allowing 
such  effects  in  the  MARS  model  substantially  improved  approximation  accuracy.  This  example, 
taken  from  Friedman  and  Silverman  (1987),  examines  what  happens  when  the  true  underlying 
function  is  exactly  additive  and  interactions  are  allowed  to  enter  the  MARS  model.  One  would 
expect  accuracy  to  deteriorate  since  allowing  for  interactions  among  the  variables  increases  the 
variance  of  /  while,  in  this  particular  case,  not  decreasing  the  bias. 

Table  4  summarizes  (in  the  same  format  as  Tables  Id,  2b,  3b)  the  results  of  100  replications 
of  the  following  simulation  experiment.  iV(=  50,100,200)  10-dimensional  covariate  vectors  were 
generated  in  the  unit  hypercube.  A  set  of  standard  normal  deviates  c,  were  then  generated  and 
response  values  were  assigned  according  to 

y.  =  O.le^^-  4-  4/[l  + 

+  3z3j  -f  2xni  4-  i5j  4-  0  •  1$,-  4-  0  •  xji 
4-  0  ■  i8i  4-  0  •  X9i  4-  0  •  zio,,  4- 1,  . 


Here  the  signal  to  noise  ratio  is  0.28  so  that  the  true  underlying  function  accounts  for  92%  of  the 
variance  of  the  response. 

The  ratio  of  the  average  ISE  values  for  the  additive  and  mi  =  2  interaction  fits  are  seen  (Table 
4)  to  be  about  0.67  at  all  sample  sizes.  The  corresponding  ratio  for  the  mi  =  10  unconstrained  fit 
is  about  0.60.  The  corresponding  square  roots  of  the  ratios  are  0.81  and  0.77.  Thus,  the  (average) 
accuracy  here  is  reduced  by  about  25%  when  the  interactive  models  are  fit  to  purely  additive  data. 
This  degradation  is  surprisingly  small  given  the  small  sample  sizes  and  the  high  dimensionality 
(n  =  10).  Note  that  the  average  GCV  scores  for  the  interactive  models  are  always  slightly  worse 
than  that  for  the  corresponding  additive  fit,  so  that  the  interactive  models  are  not  (on  average) 
claiming  to  do  better  than  the  additive  ones.  This  suggests  a  strategy  of  accepting  the  additive 
model  if  those  involving  interactions  fit  no  better  in  terms  of  the  GCV  score,  especially  owing  to 
the  increased  interpretability  of  the  additive  model. 

4.0.  Remarks. 

This  section  covers  various  aspects  (extensions,  limitations,  etc.)  of  the  MARS  procedure  not 
discussed  in  the  previous  sections. 
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Table  3b 

Summary  of  100  replications  of  the  alternating  current  series 
circuit  phase  angle,  <f>,  piecewise  cubic  fit. 


mi 

GCV 

PSE 

ISE 

N  =  100: 

1 

.36  (.057) 

.35  (.036) 

.27  (.040) 

2 

.33  (.059) 

.32  (.047) 

.25  (.052) 

4 

.32  (.059) 

.33  (.12) 

.26  (.14) 

N  =  200; 

1 

.32  (.032) 

.31  (.016) 

.23  (.017) 

2 

.25  (.033) 

.24  (.022) 

.15  (.025) 

4 

.24  (.032) 

.24  (.022) 

.15  (.070) 

N  =  400: 

1 

.30  (.020) 

.29  (.007) 

.21  (.008) 

2 

.22  (.019) 

.20  (.011) 

.11  (.012) 

4 

.21  (.019) 

.19  (.012) 

.10  (.013) 

Table  4 

Summary  of  100  replications  of  applying  MARS 
to  purely  additive  data,  Example  3.3. 


mi 

GCV 

PSE 

ISE 

N  =  50: 

1 

.30  (.092) 

.25  (.053) 

.13  (.062) 

2 

.34  (.077) 

.30  (.074) 

.19  (.085) 

10 

.34  (.077) 

.29  (.080) 

.19  (.092) 

N  =  100: 

1 

.22  (.035) 

.18  (.020) 

.053  (.024) 

2 

.22  (.040) 

.21  (.035) 

.081  (.041) 

10 

.24  (.041) 

.21  (.035) 

.088  (.042) 

N  =  200: 

1 

.17  (.022) 

.16  (.008) 

.024  (.009) 

2 

.18  (.024) 

.17  (.014) 

.036  (.016) 

10 

.19  (.025) 

.17  (.012) 

.040  (.015) 

.18 


4.1.  Constraints. 

The  MARS  procedure  is  nonparametric  in  that  it  attempts  to  model  arbitrary  functions.  It  is 
often  appropriate,  however,  to  place  constraints  on  the  final  model,  dictated  by  knowledge  of  the 
system  under  study,  outside  the  specific  data  at  hand.  Such  constraints  will  reduce  the  variance  of 
the  model  estimates,  and  if  the  outside  knowledge  is  fairly  accurate,  not  substantially  increase  the 
bias.  One  type  of  constraint  has  already  been  discussed  in  Section  3,  namely  limiting  the  maximum 
interaction  order  of  the  model.  One  might  in  addition  (or  instead)  limit  the  specific  variables 
that  can  participate  in  interactions.  If  it  is  known  a  priori  that  certain  variables  are  not  likely 
to  interact  with  others,  then  restricting  their  contributions  to  be  at  most  additive  can  improve 
accuracy.  If  one  further  suspects  that  specific  variables  can  only  enter  linearly,  then  placing  such 
a  restriction  can  improve  accuracy.  The  incremental  charge  d  (15b)  for  knots  placed  under  these 
restrictions  should  be  less  than  that  for  the  unrestricted  knot  optimization.  (The  implementing 
software  charges  0.8  •  df  and  0.4  •  df,  respectively,  for  the  additive  and  linear  constraints  where  df 
is  the  charge  for  unrestricted  knot  optimization.) 

These  constraints,  as  well  as  far  more  sophisticated  ones,  are  easily  incorporated  in  the  MARS 
strategy.  Before  each  prospective  knot  is  considered,  the  parameters  of  the  corresponding  potential 
new  multivariate  spline  basis  function  and  Bi)  (10)  can  be  examined  for  consistency  with 

the  constraints.  If  it  is  inconsistent,  it  can  simply  be  marked  ineligible  for  inclusion  in  the  model. 

4.2.  Semiparametric  Modeling. 

Another  kind  of  a  priori  knowledge  that  is  sometimes  available  has  to  do  with  the  nature  of 
the  dependence  of  the  response  on  some  (or  all)  the  predictor  variables.  The  user  may  be  able  to 
provide  a  function  g{xi,  -  •  •  ,Xn)  that  is  thought  to  capture  some  aspects  of  the  true  underlying 
function  /(xj,  •  •  < , a:„).  More  generally,  one  may  have  a  set  of  such  functions  {ffj(a:i,  •  •  ^in)}/, 
each  one  of  which  might  capture  some  aspect  of  the  functional  relationship.  A  semiparametric 
model  of  the  form 

j 

/*p(®l  1  ■  ■  ■ )  Xfi^  —  ^  ^  ■  5  Xji^  -j-  /(xj  ,  *  •  • ,  Xn)f  (24) 

1=1 

where  /(xi,  •  •  •,x„)  takes  the  form  of  the  MARS  approximation  (9),  could  then  be  fit  to  the  data. 
The  coefficients  cj  in  (24)  are  jointly  fit  along  with  the  parameters  of  the  MARS  model.  To  the 
extent  that  one  or  more  of  the  gj  successfully  describe  attributes  of  the  true  underlying  function, 
they  will  be  included  with  relatively  large  (absolute)  coefficients,  and  the  accuracy  of  the  resulting 
(combined)  model  will  be  improved. 

Semiparametric  models  of  this  type  (24)  are  easily  fit  using  the  MARS  strategy.  One  simply 
includes  {gj{xi ,  •  •  • ,  ^n)}/  as  J  additional  predictor  variables  (x„+i ,  •  •  • ,  x„+ j)  and  constrains  their 
contributions  to  be  linear.  One  could  also,  of  course,  not  place  this  constraint,  thereby  fitting  more 
complex  semiparametric  models  than  (24). 

4.3.  Collinearity. 

Extreme  collinearity  of  the  predictor  variables  is  a  fundamental  problem  in  the  modeling  of 
observational  data.  Solely  in  term  of  predictive  modeling  it  represents  an  advantage  in  that  it 
effectively  reduces  the  dimensionality  of  the  predictor  variable  space.  This  is  provided  that  the 
observed  collinearity  is  a  property  of  the  population  distribution  and  not  an  artifact  of  the  sample 
at  hand.  Collinearity  presents,  on  the  other  hand,  severe  problems  for  interpreting  the  resulting 
model. 

This  problem  is  even  more  serious  for  (interactive)  MARS  modeling  than  for  additive  or  linear 
modeling.  Not  only  is  it  difficult  to  isolate  the  separate  contributions  of  highly  collinear  predictor 
variables  to  the  functional  dependence,  it  is  difficult  to  separate  additive  and  interactive  contribu- 


tions  among  them.  A  highly  nonlinear  dependence  on  one  such  variable  can  be  well  approximated 
by  a  combination  of  functions  of  several  of  them,  and/or  by  interactions  among  them. 

In  the  context  of  MARS  modeling  one  strategy  to  cope  with  this  (added)  problem  is  to  fit  a 
sequence  of  models  with  increasing  maximum  interaction  order  (mi).  One  first  fits  an  additive 
model  (mi  =  1),  then  one  that  permits  at  most  two  variable  interactions  (mi  =  2),  and  so  on. 
The  models  in  this  sequence  can  then  be  compared  by  means  of  their  respective  optimizing  GCV 
scores.  The  one  with  the  lowest  mi  value  that  gives  a  (relatively)  acceptable  fit  can  then  be  chosen. 

4.4.  Robustness. 

Since  the  MARS  method  as  described  here  uses  a  model  selection  criterion  based  on  squared 
error  loss  it  is  not  robust  against  outlying  response  values.  Unlike  linear  regression,  however,  it 
is  not  very  sensitive  to  outliers  in  the  predictor  variable  space,  owing  to  the  local  nature  of  the 
resulting  fit;  sample  covariate  vectors  far  from  an  evaluation  point  tend  to  have  less  rather  than  more 
influence  on  the  model  estimate.  Response  outliers  will  tend  to  strongly  effect  model  estimates  only 
close  to  their  corresponding  covariate  values.  They  will  also  (slightly)  increase  the  variance  of  model 
estimates  elsewhere  by  increasing  the  number  of  multivariate  spline  basis  functions  (required  to 
capture  the  apparent  high  curvature  of  the  function  near  each  outlier). 

There  is  nothing  fundamental  about  squared-error  loss  in  the  MARS  approach.  Any  criterion 
can  be  used  to  select  the  multivariate  spline  basis  functions,  and  construct  the  final  fit,  by  simply 
replacing  the  internal  linear  least  squares  fitting  routine  by  one  that  minimizes  another  loss  criterion 
(given  the  current  set  of  multivariate  spline  basis  functions).  Using  robust /resistant  regression 
methods  would  provide  resistance  to  outliers. 

The  only  advantage  to  squared-error  loss  in  the  MARS  context  is  computational.  It  is  difficult 
to  see  how  rapid  updating  formulae  could  be  developed  for  other  types  of  linear  fitting.  For  those 
with  access  to  rich  computing  environments,  this  presents  no  problem.  For  others,  a  compromise 
strategy  can  mitigate  the  robustness  problem  for  isolated  outliers.  The  multivariate  spline  basis 
functions  are  selected  using  the  standard  MARS  approach  with  least-squares  fitting.  Given  this 
basis,  the  expansion  coefficients  {am}o^  (9)  are  then  fit  using  a  robust /resistant  linear  regression 
method  to  form  the  final  model.  This  reduces  the  influence  of  the  response  outliers  on  model 
predictions  close  to  their  corresponding  covariate  vectors.  It  does  not  remove  the  ^small)  increased 
variance  associated  with  the  additional  (now  redundant)  basis  functions. 

4.5.  Logistic  Regression. 

Linear  logistic  regression  (Cox,  1970)  is  often  used  when  the  response  variable  assumes  only 
two  values.  The  model  takes  the  form 

n 

log[p/(l  -  p)]  = 

i=l 

where  p  is  the  probability  that  y  assumes  its  larger  value.  The  coefficients  {/3,}7  are  estimated 
by  (numerically)  maximizing  the  likelihood  of  the  data.  Recently,  Hastie  and  Tibshirani  (1986) 
extended  this  approach  to  additive  logistic  regression 

n 

log[p/(l  -  p)]  =  ^/,(li). 

i=l 

The  smooth  covariate  functions  are  estimated  through  their  “local  scoring”  algorithm.  The  model 
can  be  further  generalized  by 

log[7V(l  -  p)l  =  fixu---,Xp) 

with  f(xi,  -  ■  ■  ,Xp)  taking  the  form  of  the  MARS  approximation  (9).  This  is  implemented  in  the 
MARS  algorithm  by  simply  replacing  the  internal  linear  least-squares  routine  by  one  that  does  lin¬ 
ear  logistic  regression  (given  the  current  set  of  multivariate  spline  basis  functions).  Unless  rapid 
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updating  formulae  can  be  derived  this  is  likely  to  be  quite  computationally  intensive.  A  compromise 
strategy  analogous  to  that  described  in  Section  4.4,  however,  is  likely  to  provide  a  good  approxima¬ 
tion;  the  multivariate  spline  basis  functions  are  selected  using  the  squared-error  based  loss  criterion 
and  the  coefficients  for  the  final  model  are  fit  using  a  linear  logistic  regression  on  this  basis 

set.  Note  that  in  this  setting  the  least-squares  criterion  is  more  robust  than  the  likelihood  based 
criterion. 

4.6.  Reflection  Invariance. 

The  MARS  procedure  as  described  here  is  not  necessarily  invariant  to  reflections  of  the  indi¬ 
vidual  predictor  variables.  Replacing  ar,  by  —Xi  can  (slightly)  change  the  MARS  model.  This  is  due 
to  the  fact  that  the  pure  linear  term,  associated  with  the  piecewise-linear  basis  on  each  variable,  is 
not  automatically  included  in  the  model;  but  rather  it  is  subjected  to  the  same  forward/backward 
stepwise  selection  strategy  as  all  other  potential  basis  functions.  This  gives  the  procedure  the  abil¬ 
ity  to  model  certain  types  of  dependencies  with  fewer  basis  functions  than  would  otherwise  be  the 
case.  Also,  certain  kinds  of  interaction  effects  require  less  terms  to  model  than  others. 

In  order  to  get  an  idea  of  the  size  of  this  effect  a  further  simulation  study  was  performed  on 
the  alternating  current  series  circuit  example  (Section  3).  Fifteen  additional  simulation  studies 
(N  —  200,  100  replications  each)  were  done  analogous  to  those  that  led  to  Tables  2b  and  3b. 
For  each  of  the  (total)  16  studies,  the  predictor  variables  were  each  multiplied  by  one  of  the 
16  combinations  of  (±1,  ±1,  ±1,  ±1).  The  variance  of  the  ISE  over  these  16  experiments  was 
compared  to  its  average  variance  over  the  100  replications  of  different  training  sample  sets.  For  the 
impedance,  this  ratio  was  0.156  whereas  for  the  phase  angle  it  was  0.036.  The  higher  value  for  the 
impedance  is  due  to  the  very  sharp  structure  for  very  low  joint  values  of  u?  and  C  (Figure  2,  lower 
left  frame).  In  both  cases,  however,  the  variability  in  modeling  accuracy  due  to  reflections  of  the 
predictor  variables  is  seen  to  be  very  small  compared  to  the  variability  associated  with  the  random 
nature  of  the  training  data. 

Several  modifications  of  the  MARS  procedure  that  render  it  invariant  under  variable  reflection 
are  currently  under  study.  It  remains  to  be  seen  whether  they  can  provide  approximations  that  are 
as  accurate  as  the  method  described  here. 

4.7.  Low  Dimensional  Modeling. 

The  main  advantage  of  MARS  modeling  over  existing  methodology  is  clearly  realized  in  high 
dimensional  settings.  It  can,  however,  be  competitive  in  low  dimensions  (n  <  2)  as  well.  Friedman 
and  Silverman  (1987)  studied  its  properties  for  the  smoothing  problem  (n  =  1)  and  showed  that  it 
can  produce  superior  performance,  especially  in  situations  involving  small  samples  and  low  signal 
to  noise.  These  properties  should  extend  to  surface  modeling  (n  =  2)  as  well,  although  detailed 
studies  have  not  yet  been  performed.  Friedman  and  Silverman  (1987)  also  studied  this  approach 
in  the  special  case  of  additive  modeling  {mi  =  1).  The  method  was  shown  to  be  competitive  with 
existing  methodology  in  this  application,  again  exhibiting  superior  performance  in  situations  with 
small  samples  and  low  signal  to  noise. 

5.0.  Conclusion. 

The  examples  and  simulation  studies  indicate  that  the  MARS  approach  has  the  potential  to 
become  a  useful  tool  for  data  modeling.  It  possesses  to  some  degree  the  the  desirable  properties  of 
the  recursive  partitioning  approach;  these  are  its  adaptability,  automatic  variable  subset  selection, 
and  ability  to  exploit  low  “local”  dimensionality.  Moreover,  it  is  able  to  overcome  some  of  recur¬ 
sive  partitioning’s  limitations;  it  produces  continuous  approximations  with  continuous  derivatives 
(if  desired);  it  has  additional  adaptabilty  to  exploit  functions  with  weak  high  order  interactions, 
thereby  providing  better  approximations  to  functions  that  are  nearly  linear  or  additive;  and  it  has 
increa.sed  interpretability  through  its  ANOVA  decomposition  that  breaks  up  the  approximation 
into  its  additive  and  various  interaction  components. 
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It  is  important  to  note  that  this  is  a  new  methodology  for  which  there  is,  at  present,  very  little 
collective  experience.  Its  results  should  be  interpreted  with  some  caution  until  their  reliability  is 
tested  over  time  in  a  wide  variety  of  settings.  No  doubt  as  such  experience  is  gained  useful  and 
important  modifications  to  this  basic  approach  will  become  apparent. 

A  FORTRAN  program  implementing  the  MARS  methodology  described  in  ths  report  is  avail¬ 
able  from  the  author. 
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Given  a  likelihood  l(x',  6)  and  prior  density  p(d),  where  x  and  6 
(both  typically  vector-valued)  denote  data  and  unknown  parame¬ 
ters.  respectively,  tJ.'’  starting  point  for  Bayesian  inferences  about 
0  is  the  joint  posterior  density  for  0  given  by 

fKx-,0)p{0)d0 

In  fact,  of  course,  we  are  usually  interested  in  summaries  of  the 
full  joint  posterior  distribution.  For  example,  attention  may  be 
focussed  on  univariate  marginal  densities  for  some  or  all  of  the 
components  of  0;  bivariate  joint  marginal  densities  for  various 
pairs  (0i,0j)  of  component  parameters;  or  even  on  simpler  sum¬ 
maries  in  the  form  of  posterior  first  and  second  moments.  Alter¬ 
natively,  we  may  be  interested  in  posterior  summaries  for  func¬ 
tions  of  one  or  more  of  the  comp  jients  of  0:  for  example,  margi¬ 
nal  and  joint  densities  for  6,/®,  and  0,0 

In  all  these  cases,  the  technical  key  to  the  implementation  of 
the  formal  solution  given  by  Bayes’  theorem,  for  specified  likeli¬ 
hood  and  prior,  is  the  ability  to  perform  a  number  of  integrations. 
First,  we  need  to  evaluate  the  denominator  of  (1)  in  order  to 
obtain  the  normalizing  constant  of  the  posterior  density;  then  we 
need  to  integrate  over  complementary  components  of  0.  or 
transformations  of  0,  in  order  to  obtain  marginal  (univariate  or 
bivariate)  densities,  together  with  summary  moments,  highest  pos¬ 
terior  density  intervals  and  regions,  or  whatever.  Except  in  cer¬ 
tain  rather  stylized  problems  (for  example,  exponential  families 
together  with  conjugate  priors),  the  required  integrations  will  not 
be  feasible  analytically  and  so  efficient  numerical  strategics  will 
be  required.  Finally,  the  finite  sets  of  numerical  values  obtained 
after  marginalization  need  to  be  reconstructed  into  a  graphical 
representation  of  a  univariate  or  bivariate  marginal  posterior  dis¬ 
tribution. 

We  shall  outline  numerical  integration  strategies  which  have 
proved  efficient  and  reliable  for  problems  of  this  kind.  A  brief 
account  will  also  be  given  of  the  techniques  used  to  produce 
univariate  density  curves  and  contour  representations  of  l/ivariate 
densities.  Throughout,  we  shall  provide  diagramatic  illustration 
of  the  main  ideas. 

General  accounts  of  approaches  to  implementing  the  Bayesian 
paradigm  are  given  in  Smith  ei  al.  (1985)  and  Smith  et  al. 
(1987).  More  specialized  technical  accounts  can  be  found  in  Nay¬ 
lor  and  Smith  (1982)  and  Shaw  (1985,  1986a,  1986b).  Applica¬ 
tions  of  the  kinds  of  techniques  described  here  can  be  found  in 
Naylor  and  Smith  (1983),  Skene  (1983),  Skene  et  al.  (1986). 
Racine  el  al.  (1986)  and  Shaw  (1987). 

We  shall  first  describe  an  iterative  quadrature  strategy  that  has 
proved  effective  for  problems  involving  up  to  six  parameters.  It 
is  well  known  that  univariate  integrals  of  the  type 

/  e-'/(()  dt  (2) 

arc  often  well-approximated  by  Gai  ss-Hermite  quadrature  rules  of 
the  form 

Z  (3) 

1  =  1 

where  Ij  is  the  ith  zero  of  the  Hcrmitc  polynomial  //,(i).  In  par¬ 
ticular,  if  fit)  is  a  polynomial  of  degree  al  most  2n  -  1,  then  (3) 
approximates  (2)  without  error.  It  follows  that,  if  h(t)  is  a  suit¬ 
ably  well-behaved  function  and 

git)  =  /i(()(2;r<T^)'*cxp|  j,  (4) 

then 

I  git)  d/  -  £  m,giz,  ).  (5) 

1=1 


where 

m,  =  w,  exp((3)'j2cT,  z,=p  +  'J2aij  (6) 

(see  Naylor  and  Smith,  1982).  We  see,  therefore,  that,  expressed 
in  informal  terms,  Gauss-Hermitc  rules  are  likely  to  prove  very 
efficient  for  functions  which  closely  resemble  ‘polynomial  x  nor¬ 
mal’  forms.  In  fact,  this  is  a  rather  rich  class  which,  even  for 
moderate  n  (S  11,  say),  covers  many  of  the  likelihood  x  prior 
shapes  we  typically  encounter  for  parameters  defined  on  (-<><., «>). 
Moreover,  the  applicability  of  this  approximation  is  vastly 
extended  by  working  with  suitable  transformations  of  parameters 
defined  on  other  ranges,  such  as  (0,~)  or  (a,  6),  using,  for  exam¬ 
ple.  log(r)  or  log(/-a)- log(6-r),  respectively.  Of  course,  to  use 

(5)  we  must  specify  fz  and  a  in  (6).  It  turns  out  that,  given  rea¬ 
sonable  starting  values  (from  any  convenient  source;  prior  infor¬ 
mation,  maximum  likelihood  estimates  etc),  we  can  successfully 
iterate  on  (5),  substituting  into  (6)  estimates  of  the  posterior  mean 
and  variance  obtained  using  (5)  based  on  previous  values  of  m, 
and  z,.  Moreover,  we  note  that  if  the  posterior  density  is  well- 
approximated  by  the  product  of  a  normal  and  a  polynomial  of 
degree  al  most  2n-3,  then  an  n-poinl  Gauss-Hcrmiie  rule  will 
prove  effective  for  simultaneously  evaluating  the  normalizing  con¬ 
stant  and  the  first  and  second  moments,  using  the  same  (iterated) 
set  of  m,  and  r,  .  In  practice,  it  is  efficient  to  begin  with  a  small 
grid  size  (n  =  3  or  n  =  4)  and  then  to  gradually  increase  the  grid 
size  until  stable  answers  are  obtained  both  within  and  between  the 
last  two  grid  sizes  used. 

Our  discussion  so  far  has  been  for  the  one  dimensional  case. 
Clearly,  however,  the  need  for  an  efficient  strategy  is  most  acute 
in  higher  dimensions.  The  ’obvious’  extension  of  the  above  ideas 
is  to  use  a  cartesian  product  rule  giving  the  approximation 

/...//((, . (*)dt,...dt*  =  . (7) 

*»  *» 

where  the  grid  points,  and  the  weights,  m/^7  found  from 

(6) .  substituting  the  iterated  estimates  of  p  and  cr^  corresponding 
to  the  marginal  component 

The  problem  with  this  ’obvious’  strategy  is  that  the  product 
form  is  only  efficient  if  we  arc  able  to  make  an  (at  least  approxi¬ 
mate)  assumption  of  posterior  independence  among  the  individual 
components. 

To  overcome  this  problem,  we  first  apply  individual  parameter 
transformations  of  the  type  discussed  above,  then  we  attempt  to 
transform  the  resulting  parameters,  via  an  appropriate  linear 
transformation,  to  a  new,  approximately  orthogonal,  set  of 
parameters.  At  the  first  step,  this  linear  transformation  derives 
from  an  initial  guess  or  estimate  of  the  posterior  covariance 
matrix  (for  example,  based  on  the  observed  information  matrix 
from  a  maximum  likelihood  analysis).  Successive  transformations 
arc  then  based  on  the  estimated  covariance  matrix  from  the  previ¬ 
ous  iteration. 

We  are  led  to  the  following  general  strategy. 

1)  Reparamelrize  individual  parameters  so  that  the  resulting  work 
ing  parameters  all  lake  values  on  the  real  line. 

2)  Using  initial  estimates  of  the  joint  posterior  mean  vector  and 
covariance  matrix  for  the  working  parameters,  transform  further 
to  a  centred,  scaled,  more  ’orthogonal’  set  of  parameters. 

3)  Using  the  derived  initial  location  and  scale  estimates  for  these 
’orthogonal’  parameters,  carry  out,  on  suitably  dimensioned 
grids,  cartesian  product  integration  of  functions  of  interest. 

4)  Iieralc.  successively  updating  the  mean  and  covariance  esti¬ 
mates,  until  stable  results  are  obtained  both  within  and  between 
grids  of  specified  dimension. 

We  now  de.scribe  an  iterative  importance  sampling  strategy 
which  has  proved  effective  in  higher  dimensions.  The  importance 
sampling  approach  to  numerical  integration  is  based  on  the  obser¬ 
vation  that,  if  /  and  g  arc  density  functions. 
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|/Wdx  =  /  l/U)/«Wls(x)  tU 

=  /  [/W/«W  dG(x)  =  EainX)/g{X)]. 

which  suggests  the  ‘stalisiicar  approach  of  generating  a  sample 
from  the  distribution  G  and  using  the  average  of  the  values  of  the 
ratio  fig  as  an  unbiased  estimator  of  J f(x)  dx.  However,  the 
variance  of  such  an  estimator  clearly  depends  critically  on  the 
choice  of  G,  it  being  desirable  to  choose  g  to  be  ‘similar’  to  /- 
In  the  univariate  case,  if  we  choose  g  to  be  heavier-tailed  than 
/,  and  if  we  work  with  Y  =  G(X),  the  required  integral  is  the 
expected  value  of  /[G"‘(X)l/g(G"'(X)]  with  respect  to  a  uniform 
distribution  on  the  interval  (0, 1).  Owing  to  the  periodic  nature  of 
the  ratio  function  over  this  interval,  we  are  likely  to  get  a  reason¬ 
able  approximation  to  the  integral  by  simply  taking  some  equally 
spaced  set  of  points  on  (0, 1).  rather  than  actually  generating  ‘uni¬ 
formly  distributed’  random  numbers.  If  /  is  a  function  of  more 
than  one  argument  (t,  say),  an  exactly  parallel  argument  suggests 
that  the  choice  of  a  suitable  g  followed  by  the  use  of  a  suitably 
‘uniform'  configuration  of  points  in  the  k-dimcnsional  unit  hyper- 
cube  will  prove  an  acceptable  alternative  to  the  ‘costly’  procedure 
of  generating  ‘random’  uniformly  distributed  points  in  t- 
dimensions. 

However,  the  effectiveness  of  all  this  depends  on  choosing  a 
suitable  G,  bs'aring  in  mind  that  we  need  to  have  available  a  flexi¬ 
ble  set  of  possible  distributional  shapes,  for  which  G"'  is  avail¬ 
able  explicitly.  In  the  univariate  case,  such  a  family  defined  on  H 
is  provided  by  considering  the  random  variable 

X^  =  Ah(U)  (1  A)h(\-U). 

where  U  is  uniformly  distributed  on  (0.1),  (i  :((),!)  — >  H  is  a 
monotone  increasing  function  such  that  lim/i(u)  =  -«>  and  OS 

u  -*0 

A  S  1  is  a  constant.  The  choice  A  =  0.5  leads  to  symmetric 
distributions;  as  A  — >  0  or  A  -♦  1  we  obtain  increasingly  skew 
distributions  (to  the  left  or  right).  The  tail-behaviour  of  the  distri¬ 
bution  IS  governed  by  the  choice  of  the  function  h.  Thus,  for 
example.  h{u)  =  log(u)  leads  to  a  family  whose  symmetric 
member  is  the  logistic  distribution;  h(u)  =  -  uin|jrr(l  u)|  leads 
to  a  family  whose  symmcuic  member  is  the  Cauchy  distribution. 
Moreover,  the  momenis  of  the  distribution  arc  polynomials  in  A 
(of  corresponding  order),  the  median  is  linear  in  A,  etc,  so  that 
sample  information  about  such  quantities  provides  rfor  any  given 
choice  of  h)  operational  guidance  on  the  a()propriatc  choice  of  A. 
For  details,  sec  Shaw  (lOSba).  To  use  this  family  in  the  mul- 
liparamctcr  case,  we  again  employ  individual  parameter  transfor¬ 
mations,  so  that  all  parameters  belong  to  H,  together  with  'ortho- 
gonali/.ing'  iranslormalions,  so  that  parameters  can  be  treated 
'independently'  In  the  iransfornied  setting,  it  is  natural  to  con¬ 
sider  an  Iterative  import. inte  sampling  strategy  which  atlempLs  to 
learn  about  an  appropriate  choice  of  G  for  each  paraiiieicr. 

As  we  remarked  earlier,  pan  of  this  strategy  requires  the 
spccilicalion  of  ‘uniform'  conligiiralions  of  points  in  the  k- 
dimensional  unit  hypercutx'  Ihis  problem  has.  in  I.KI  been 
exiensisely  studied  bs  iiumfvr  Iheorisis  and  systematic  esperi- 
mentJlion  with  s.irioiis  suggested  loriiis  ol  '(|uasi-ianiioiiT 
sequences  has  identilied  elleclive  loriiis  of  lonfiguralioii  lor 
imporunce  sampling  purposes,  l-or  details,  see  Shaw  ilUSlibi 
Ttie  general  strategy  is  ifien  the  follow  mg 

li  Rs parainetri/e  in<lividii:il  p.ir.iiiielers  so  iliat  ilie  resulting  work 
iiig  paraine'  -rs  all  t.ike  values  on  the  real  line 
2 1  (  sing  initial  estimates  ol  the  |oiiil  [lo, tenor  mean  vcslot  and 
-ovariarice  matrix  for  the  woikiiig  p.ir.iiiielers.  ir.irisloriu  lurlher 
to  a  scntred.  scaled,  more  'orihoeon.it  sel  of  parameiers 

f)  In  terms  ol  Ihese  tr.inslornied  p.irjiiu’ters.  so,  gi  1 1  [Ic'r  ,. 

lor  'suil.ilde'  choices  ol  g,,  /  I,  .k 
Ail  se  the  inverse  ;  d/'  Ir.iiistorm.iiion  to  redme  iti.-  iToMem  i.i 
ihjl  o!  c.iKulaling  iii  aoi.ige  le.ir  .i  sint.ible'  imitoim 
coriitgiir.itiori  in  the  i  dimeiision.il  hvpenuhe 
Sil  sc  inlorm.ilioii  Iroin  this  s.miple'  lo  le.iin  .ibo..i  fewnes-.. 
I.iilwi’iehl  eU  lor  .  ,i,  li  /  I  *  .iii.l  h  r.,eihoi  -.e  b  llir'  i  , 


j  =  1 . as  well  as  revising  estimates  of  the  mean  vector 

and  covariance  mapix. 

6)  Iterate  until  the  sample  variance  of  replicate  estimates  of  the 
integral  value  is  sufficiently  small. 

In  reconstructing  either  a  univariate  density  or  the  contours  of  a 
bivariate  density,  we  begin  with  a  set  of  density  values  at  some 
set  of  parameter  values.  In  the  context  of  the  product  rule  quadra¬ 
ture  approach,  the  parameter  values  will  correspond  to  grid  points 
selected  by  a  quadrature  rule.  In  the  context  of  importance  sam¬ 
pling,  the  resulting  configuration  of  spot-heigh's  would  typically 
be  too  irregular  for  efficient  graphical  reconstruction  and  so  a 
mixed  strategy  is  adopted,  using  a  quadrature  approach  for  the 
parameters  of  interest  and  sweeping  out  the  others  by  importance 
sampling. 

In  either  case,  the  approach  we  adopt  for  the  parameters  of 
interest  is  to  fit  splines  to  the  logarithms  of  the  density  values. 
For  univariate  reconstruction  we  use  ‘not-a-knot’  cubic  splines; 
for  contouring,  we  use  tensor  product  splines.  See  also  Smith  el 
al.  (1985)  and,  for  a  much  more  detailed  account,  Shaw  (1985). 

The  strategies  outlined  in  this  paper  depend  heavily  on  the 
availability  of  interactive  computing  facilities  with  graphics  capa¬ 
bilities.  At  the  time  of  writing  (and  for  the  foreseeable  future), 
rapid  changes  are  taking  place  in  both  technical  and  economic 
aspects  of  the  availability  of  appropriate  computing  environments. 
The  direction  of  these  changes  will  clearly  influence  the  form  in 
which  Software  for  Practical  Payesian  Statistics  will  be  packaged 
and  marketed,  and  this  applies,  in  particular,  to  the  software  relat¬ 
ing  to  these  strategies.  For  the  present,  anyone  interested  in 
obtaining  some  form  of  this  software  should  contact  the  author. 
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0.  Abstract 

In  a  computational  experiment,  the  data  are  produced  by  a 
computer  program  that  models  a  physical  system.  The  experiment 
consists  of  a  .set  of  model  runs;  the  design  of  the  experiment  specifics 
the  choice  of  program  inputs  for  each  run.  This  paper  centers  on  the 
problem  of  prediction  (interpolation),  the  goal  of  which  is  to  devese  a 
dcsign/analysis  method  which  will  provide  predictions  of  model  ouput 
for  input  values  not  run.  We  adopt  a  Bayesian  approach  as  the  basis 
for  the  analysis.  Uncertainty  about  the  response  function  is  quantified 
by  choosing  a  class  of  probability  distributions  over  the  function  space. 
This  leads  to  design  procedures  based  on  maximizing  the  expected 
reduction  in  "amount  of  uncertainty",  where  the  latter  can  be  defined 
formally  in  terms  of  properties  of  the  posterior  distribution.  Here  we 
use  as  a  design  optimality  criterion  the  determinant  of  the  posterior 
covariance  matrix  of  the  responses  at  the  input  configurations  at  which 
we  want  to  make  predictions.  This  requires  maximization  of  the 
determinant  of  the  prior  covariance  matrix  of  the  responses  at  the 
design  sites.  We  describe  our  computer  algorithm  for  constructing 
optimal  designs,  and  give  some  examples  of  designs  that  it  produces. 

I.  Introiluclion 

1 . 1  Computer  models  and  computational  experiments. 

There  is  widespread  and  growing  use  of  computer  models  as  tools 
in  scientific  research.  As  surrogates  for  physical  or  behavioral  systems, 
computer  models  can  be  subjected  to  experimentation,  the  goal  being  to 
predict  how  the  corresponding  real  system  would  behave  under  certain 
conditions.  This  paper  is  motivated  by  the  goal  of  getting  information 
from  computer  models  as  efficiently  as  possible. 

Here  we  regard  a  computer  model  as  a  computer  program  that  maps 
a  vector  of  input  variables  (parameters)  r  imo  a  vector  of  output 
variables  y  .  where  l  and  y  are  physically  meaningful.  We  view  y  as  a 
function  y(i )  over  some  domain  T  in  the  space  of  the  input  variables. 
This  function  is  deterministic:  if  the  program  is  run  twice  (on  the  same 
computer)  with  the  same  value  of  r ,  the  same  value  of  y  will  result. 

We  consider  a  computaiional  experiment  to  be  a  collection  of  'uns 
of  the  computer  mcxlel,  made  for  the  purpose  of  investigating  y(i)  for 
re  /  .  For  convenience,  we  shall  consider  7  to  be  defined  only  by  the 
design  variables,  i.e.,  those  variables  that  are  changed  during  the 
course  of  the  experiment.  In  a  typical  experiment  of  n  runs,  the  i“ 
computer  run  is  made  using  inputs  i,€  T,  i  =  1,  2,  .  n ;  this  collection 

of  input  configurations  is  called  the  experimental  design  . 

There  arc  several  important  general  classes  of  problems  that  can  be 
approached  through  computational  experiments,  eg.,  prediction, 
sensitivity  analysis,  uncertainty  analysis,  optimization,  rtxM  finding, 
and  integration  of  output  Perhaps  the  most  fundamental  is  the  problem 
of  prediction  of  y(r)  at  sites  t  that  have  not  been  directly  observed. 
Ihc  design  of  experiments  for  this  purpose  is  the  subject  of  this  paper. 

We  consider  a  solution  to  the  prediction  problem  to  include  a 
prediction  cqiiaiion  y(r).  formulas  for  evaluating  the  unceruiiniy  of 
prcdiclion.  and  rules  lor  chrxising  the  design  sites.  Because  of  the 
nature  of  our  approach,  which  is  dc.scribed  below,  our  melhod  is  quite 
similar  lo  interpolation,  m  that  the  prediction  of  y  will  be  identical  to 
the  observed  v  at  values  of  r  for  which  the  mixlel  has  been  run.  At 
other  values  of  r,  our  prediction  will  lake  the  form  of  a  probability 
disiribulion,  ihc  mean  of  which,  expressed  as  the  lunclinn  y'fr ).  can  be 
used  as  a  prediction  equation 

We  approach  the  problem  from  a  Bayesian  point  of  view,  under 
which  urHertaii..y  about  Ihc  fiinclion  y  is  cxprcssetl  by  means  of  a 
probability  disiribiilton  over  all  possible  response  funclions.  Random 
fiini  lions  (siochasiic  prixesscs,  random  fields)  have  tx'en  used  as 
morlcls  in  knging  and  otlier  spalial  applii  .iiioiis  lor  a  long  lime,  noi 
generallv  m  an  overt  Bayesian  sensr.’,  however  Hie  prerlution  prolrlem 


in  spatial  settings  is  usually  formulated  as  the  problem  of  making 
inferences  about  the  realization  of  a  spatial  stochastic  process  T(r), 
given  the  values  of  that  process  at  a  set  of  "sites"  r,,  .  .  ,r,.  See 
Ylvisaker  (1987)  for  a  discussion  of  problems  of  this  general  type  and 
of  the  associated  design  problems.  Recently,  Shewry  and  Wynn  (1986) 
and  Sacks  and  Schiller  (1987)  proposed  and  u.scd  design  optimality 
criteria  based  on  spatial  stochastic  process  models  to  compute  optimal 
designs  for  prediction  in  various  settings.  Kimcldorf  and  Wahba(l970) 
were  the  first,  as  far  as  we  know,  to  use  a  strxhastic  process  in  an 
explicitly  Bayesian  sense,  for  the  purpose  of  predicting  a  fixed  but 
unknown  function.  Only  recently  has  there  emerged  an  interest  in 
applying  stochastic  process  models  to  the  design  and  analysis  ol 
computational  experiments  (Sacks,  Schiller,  and  Welch,  1988). 

In  this  paper,  we  shall  focus  on  the  problem  of  designing 
computational  experiments  for  prediction.  We  present  our  approach  to 
prediction,  given  a  design,  briefly  in  Section  2.  In  Section  3,  we 
describe  a  design  criterion  and  our  algorithm  for  constructing  designs 
that  arc  optimal  with  respect  to  it. 

2.  Prerlietion 

We  represent  "knowledge"  about  the  unknown  function  y(r)  by  a 
strKhastic  process  Kfr),  where 

(PI).  Kir)  has  a  normal  disu-ibution  with  mean  p  and  variance  cr^  (the 
.same  for  all  r).  and 

(P2).  For  any  pair  of  sites  i€  7  .  seT.  the  correlation  between  K(r)  and 
K  (s )  is  a  function  only  of  the  vector  of  differences  d  =  t-s,  i.e,, 

Pu  =Corr(y(t).y(s))  =  Jt(l-s)=/{(d).  (2.1) 

where  fi(d)  =  R(-d)  and  ^(0)  =  I. 

The  posterior  distribution  of  K  on  any  finite  .set  in  7'.  given  the  set 
of  .served  responses  y(D)  on  the  .set  of  design  sites  D ,  is  easily 
obu.  led  as  a  conditional  multivariate  normal  disbibution. 

Let 

Co  ^Corr(y{D).yfD)) 

be  the  n  X  n  matrix  whose  elements  ate  the  prior  correlations  between 
the  responses  at  all  pairs  of  design  sites.  Let 

rn(l)  =  Corr(y(t).yiD )) 

be  the  n  -vector  of  prior  correlations  between  K  (r )  and  K  (D  ). 

Tbcn  the  posterior  disbibution  of  K(f )  is  normal  with  mean: 

Bon  -  B  +  '-n(')Cn'(vo  -  p/ )  (2.2) 

and  variance 

=  - '•n(OCnVn(r)|,  (2.3) 

where  ./  in  (2.2)  is  an  n -vector  of  Ts.  and  yp  is  the  set  of  oKserved 
response's  y{l))  written  as  vector.  The  posterior  covariance  of  >’(/) 
and  )  IS 

-  rJ,il)Cp'r„U)].  (2.4) 

All  knowlextgc  aNiul  yfO  given  the  data  ami  the  pnor  procc.ss  is 
emfKHlied  m  the  posterior  process  dehned  by  (2.2)-(2.4),  which  is 
(iaussian  like  Uic  prior  process,  but  is  no  longer  suuinnary.  Since  we 
shall  use  the  posterior  process  for  pr''diction,  we  shall  often  refer  to  it 
as  the  predictive  process  '  Tlie  mean  of  this  process  (2.2).  viewed  as  a 
liinction  of  r.  can  be  uken  as  y{i).  ibis  is  an  inierpolaiing  function, 
siiuc  11  pass<.'\  through  the  ooserved  v 's,  ‘Ihc  posterior  v;iri.ince  (2 
can  he  usctl  as  a  nicasurr  of  uncertainly  of  prediction  at  site  t .  H  is 
necess.irilv  o  at  the  observed  sites 


3.  Design 

3.1  Dciiign  Crilcrion 

Suppose  we  want  to  design  an  experiment  in  n  runs  for  prediction 
at  a  finite  set  of  n'  sites  7'"  c7' ,  where  n'  >  n.  After  the  experiment  is 
rur,  knowledge  of  y  at  these  sites  will  be  embodied  in  the  n"- 
dimcnsional  normal  distribution  of  Y{T'  \D)  generated  by  the 
predictive  process  there.  The  mean  and  the  covariance  matrix 

Ij..  IP  of  this  dispibution  can  be  obtained  using  (2.2)-(2.4). 

We  shall  adopt  as  our  design  criterion  the  minimization  of  the 
determinant  of  We  refer  to  this  criterion  as  D-optimality 

because,  like  the  usual  D-oplimalily  criterion  in  the  linear  model 
setting,  its  goal  is  to  minimize  the  posterior  generalized  variance  of  the 
unknowns  that  one  is  trying  to  estimate.  Shewry  and  Wynn(1986)  have 
shown  that  this  is  equivalent  to  maximizing  the  expected  gain  in 
information  (Lindley,  1956),  where  information  is  measured  by 
Shannon’s  entropy.  Shewry  and  Wynn  also  showed  that  this  is 
equivalent  to  maximizing  the  determinant  of  Cq  . 

Given  a  correlation  function,  a  D-optimal  design  can,  in  principle, 
he  found  before  any  data  on  y  are  taken,  since  the  optimality  criterion 
d(x;s  not  depend  on  y .  Except  in  a  few  special  cases,  however,  there 
seem  to  be  few  theoretical  results  available  foi  finding  such  designs. 
The  designs  constructed  for  this  paper  were  obtained  from  a  computer 
algorithm  adapted  from  DETMAX  (Mitchell,  1974),  which  was  first 
developed  for  the  purpose  of  constructing  D-optimal  designs  for  linear 
regression.  The  optimization  method  is  based  on  a  series  of 
"excursions",  which  arc  sequences  of  designs  in  which  each  design 
differs  from  its  predecessor  by  the  presence  or  absence  of  a  single  site. 
The  first  and  last  designs  in  an  excursion  have  n  sites;  the  intermediate 
designs  all  have  fewer  sites.  (This  restriction  to  designs  with  n  or  fewer 
sites  was  put  in  to  avoid  numerical  problems  associated  with  the  nearly 
singular  Cp  mau-ices  that  sometimes  arose  when  the  number  of  sites 
became  large.  It  ensures  that  Cp  for  any  design  D  encountered  during 
the  excursion  is  at  least  as  well  conditioned  as  the  starting  design.) 

The  first  step  of  each  excursion  removes  a  site  from  the  best  current 
design.  At  subsequent  steps,  a  site  is  added,  unless  the  design  at  that 
step  has  already  been  declared  a  "failure  design’’,  in  which  case  a  site  is 
removed.  (All  designs  encountered  since  the  most  recent  successful 
excursion  are  designated  as  failure  designs.)  For  the  purpose  of 
checking  a  design  for  equivalence  to  a  failure  design,  only  the 
determinants  of  their  correlation  mauices  are  compared;  thas  false 
equivalence  may  occasionally  be  declared.  All  additions  and  deletions 
arc  made  with  the  goal  of  maximizing  the  determinant  of  the 
correlation  matrix  for  the  resulting  design.  By  this  criterion,  the  best 
sue  (  to  add  to  an  existing  design  D  is  the  one  at  which  the  variance 
function  o/,,,  is  greatest.  It  can  also  be  shown  that  the  largest 
determinant  after  deletion  of  a  site  in  C  can  be  achieved  by  choosing 
that  site  to  be  the  one  associated  with  the  greatest  clement  of  the 
diagonal  of  Cp' . 

The  search  for  the  best  site  to  add.  i.e.,  the  (  at  which  the  variance 
function  (2.3)  is  maximized,  is  conducted  over  a  grid  in  7 .  Except 
when  7  has  few  dimensions  or  the  grid  is  very  coarse,  it  is  not  practical 
10  make  the  search  exhaustive.  Instead  we  have  incorporated  a 
multiple  search  pr(x:cdurc.  that  can  best  be  envisioned  by  thinking  of  a 
set  of  It  hikers  trying  to  climb  a  hill  Each  hiker  starts  at  one  of  the  n 
current  design  sites;  at  each  of  these,  the  variance  function  is  zero.  The 
algorithm  prixeeds  by  stages,  where  in  each  stage,  each  hiker  takes  one 
step  in  the  direction  that  allows  him  to  increase  his  altitude  the  most 
We  restrict  him  to  consider  only  the  2k  neighboring  grid  [xiincs 
associated  with  a  change  in  exactly  one  of  the  k  design  variables,  and 
of  course  we  don’t  let  him  step  ouLside  of  7  .  Under  this  procedure,  tlie 
variance  liinciion  (2  3)  is  evaluated  at  (at  most)  Ink  sites  m  each  stage 
.Sometimes,  two  hikers  will  merge,  in  which  case  they  continue  as  one. 
Ihc  search  ends  when  all  hikers  have  stopped  at  (ItKal)  maxima;  ihc 
sue  that  corresponds  to  the  largest  of  these  is  taken  to  be  the  best  site  to 
bring  into  the  design  at  the  current  point  m  the  excursion 


The  number  of  excursions  made  during  each  search  (  ”uy  ”)  is 
determined  by  restricting  the  maximum  allowed  deviation  from  the 
nominal  number  of  runs  (n),  the  maximum  allowed  number  of 
successive  excursions  that  fail  to  improve  I  Co  I,  and  the  maximum 
allowed  number  of  "failure  designs’".  (We  generally  set  the.se 
rcsteictions  to  4,  10,  and  20,  respectively.)  When  one  of  these 
constrainLs  causes  the  search  to  end,  a  check  for  local  optimality  is 
made  by  removing  each  design  site  in  turn  and  attempting  to  replace  it 
by  another,  using  the  "hikers"  algorithm.  If  the  latter  succeeds  in 
finding  the  global  maximum  of  the  variance  function  in  each  case,  then 
D  IS  locally  optimal  in  the  scn.se  that  it  cannot  be  increased  by  moving 
a  single  site.  However,  the  success  of  the  "hikers  ”  algorithm  is  not 
guaranteed,  and  even  if  it  were,  the  search  would  not  necessarily 
produce  a  global  optimum. 

Table  1  gives  an  example  of  a  design  (on  a  grid  in  the  5- 
dimcnsional  unit  hypercube)  generated  by  our  algorithm  for  the  ca.se 
n  =  6.  it  =5,  under  a  "product  linear"  correlation  function; 

Corr(F(r),  K(.s))  =  /f(d)  =  n(l  -(1  -p^)ld^  I), 

where  =  <;  ~  'j  and  p^  =  0.99, ;  =  1 , .  .  ,  t .  (When  generating 
designs  in  the  absence  of  previous  data,  we  usually  choose  the 
correlation  function  to  be  a  product  of  identical  one-dimensional 
correlation  functions.) 


Table  1.  Allegedly  D-optimal 
design  in  5  variables  and  6  runs. 


Site  No. 

'i 

'2 

<3 

^4 

's 

1 

0.0 

0.0 

0.0 

0,0 

0,0 

2 

0.6 

0.0 

1.0 

1.0 

0,0 

3 

0.0 

0.6 

1.0 

0.0 

1.0 

4 

i.O 

1.0 

0.6 

0.0 

0.0 

5 

1.0 

0.0 

0.0 

0,6 

1.0 

6 

0.0 

1.0 

0.0 

1,0 

0.6 

This  design  cxhibiLs  some  interesting  geometrical  steueture.  Each 
of  the  sites  in  set  A  =  (  2,  3,  4,  5,  6  )  is  at  distance  2.8  from  its  two 
nciirest  neighbors  in  A  and  at  di.stancc  3.2  from  its  two  most  distant 
neighbors  in  A  ,  and  each  site  in  A  is  at  distance  2.6  from  site  I.  (Here 
"distance"  is  measured  along  the  grid.)  Because  of  the  high  value  of  p. 
there  is  a  large  region  in  the  middle  of  T  in  which  there  arc  no  design 
sites;  predictions  here  rely  heavily  on  information  from  the  surrounding 
design  sites.  This  characteristic  is  even  more  pronounced  for  smoother 
correlation  functions.  If  we  use  the  cubic  correlation: 

nil  ri 

/=i 

with  a  and  h  chosen  so  that,  if  .t  and  r  arc  at  opposite  comers  of  the 
5-cubc  7  ,  Cozr(K(r),  K(,s))  =  Corrfno,  r(.s))  =  0.99'',  all  six  sites 
111  the  optimal  design  arc  on  comers  of  T.  In  fact,  this  design  turns  out 
to  be  equivalent  to  the  D-optimal  first  order  regression  design  in  ,5 
l,>clots  .ind  6  runs  (Galil  and  Kiefer,  1980). 

At  the  other  extreme,  designs  that  infiltrate  7"  to  a  greater  extent  can 
he  constructed  hy  using  correlation  functions  R{d)  that  decrease 
rapidly  with  id  I .  For  example,  consider  the  correlation  function: 

«(d)  np'*"’ 

I  -1 

Viiih  p  Ilic  (ksi  K>  run  design  (on  a  grid  in  Uic  unit 

Mjuarc)  prjKlmcd  hy  our  algorithm  in  10  tries  is  shown  in  f  igure  ).  All 


SO 


ten  tries  gave  slightly  different  determinant  values,  so  it  is  unlikely  that 
this  design  is  mily  optimum.  There  seemed  to  be  little  point  in 
undertaking  more  tries,  however,  since  the  computing  time  per  try  was 
about  45  seconds  on  a  Cray  X-MP.  We  did  try  various  grid  sizes,  to 
avoid  penalizing  ourselves  by  choosing  too  coarse  a  grid.  We  found 
that  20x20  was  sufficient;  finer  grid  sizes  require  increasingly  longer 
computation  time  with  little  apparent  benefit. 

We  favor  this  kind  of  design  as  an  initial  design  in  a  stagewise 
approach,  in  which  the  correlation  function  that  is  used  to  generate  the 
design  sites  at  each  stage  may  change  during  the  course  of  the 
cxpcrimenL  Methods  for  selecting  the  correlation  function  via  cross- 
validation  arc  discussed  in  Currin,  Mitchell,  Morris,  and  Ylvisakcr 
(1988);  some  applications  to  computer  models  arc  given  there  also  as 
examples,  and  applications  are  given  there  also. 
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Figure  1.  Best  16-run  design  found  on  a  20  X  20  grid, 
using  an  exponential  correlation  function  with  p  =  0,0001. 
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ABSTRACT 

Additive  Principal  Components  are  a  generaliza¬ 
tion  of  linear  principal  components,  where  the 
usual  linear  function,  a,Xi,  defining  the  linear 
principaJ  component,  is  replaced  by  a 

possibly  non-linear  function,  to  form  an 

additive  principal  component  We  in¬ 

vestigate  the  analogy  to  the  smallest  linear  prin¬ 
cipal  component.  We  present  two  approaches  to 
estimation  —  a  finite  dimensional  method,  based 
on  a  matrix  eigen  decomposition,  and  an  iterative 
algorithm,  based  on  a  componentwise  minimiza¬ 
tion  scheme. 

The  smallest  additive  principal  component 
describes  nonlinear  structure  irw  a 
high-dimensional  space.  Consequently  it  is  diffi¬ 
cult  to  interpret  the  estimated  functions  in  terms 
that  ate  meaningful  for  the  data  analyst.  For  the 
additive  principal  component,  the  task  of  inter¬ 
pretation  is  almost  intractable  without  tools  for 
real  time  graphical  interaction.  With  these  tools, 
a  pleasingly  direct  method  for  interpretation  of 
the  functions  in  terms  of  the  origineil  variables  is 
possible. 

1  INTRODUCTION 

In  this  paper  we  investigate  the  additive  ana¬ 
logue  to  the  smallest  principal  component,  that 
is,  we  estimate  additive  functions  from  multivari¬ 
ate  data  which  satisfy  as  nearly  as  possible  the 
constraint  : 

V 

^<t>j(Xj)  =  0. 

j=i 

Such  an  additive  constraint  describes  high¬ 
dimensional  structure  in  the  data.  Recall  the 
linear  structure  implied  by  a  linear  constraint, 
l{x)  =  a  ■  X  =  0.  If  the  data  nearly  satisfy 
this  constraint,  they  lie  close  to  a  linear  mani- 
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Research  Group,  Bellct^re,  445  South  Street,  Morristown, 
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fold  of  co-dimension  p  —  1.  Analogously,  an  ad¬ 
ditive  constreiint  defines  an  additive  manifold  of 
co-dimension  1,  and  data  ne^ly  satisfying  this 
constraint  lie  near  this  additive  manifold. 

Estimation  of  constraints  is  an  appropriate 
analysis  tool  when  the  search  for  structure  in  the 
data  is  undirected,  that  is,  no  variables  are  des¬ 
ignated  a  priori  as  predictors  of  a  response  of 
interest.  Hence  it  is  a  valuable  exploratory  tool 
for  investigating  dependencies  in  multivariate  ob¬ 
servational  data,  where  Vciriables  are  usually  in¬ 
terdependent. 

Additive  principal  components  were  first  con¬ 
sidered  in  the  context  of  detecting  instability  in 
the  additive  regression  model.  The  importance 
of  recognizing  nonlinear  dependencies  among  the 
predictor  variables  when  fitting  additive  regres¬ 
sion  models  is  analogous  to  the  importance  of 
detecting  coUinearity  patterns  when  fitting  lin¬ 
ear  models  (Silvey  1969).  Suppose  we  were  to 
fit  an  additive  model  Y  w 
data,  when  there  is  an  exact  concurvity  between 
the  predictors,  that  is,  there  are  functions  of  the 
variables  such  that  situa¬ 

tion,  the  alternative  fit  ; 

>=i 

is  indistinguishable  from  the  initial  one.  While 
exact  concurvity  is  unlikely,  even  if  the  data  come 
close  to  satisfying  this  constraint,  some  or  all  of 
the  estimated  6j  are  likely  to  be  unstable.  A 
method  which  enables  us  to  examine  how  close 
the  data  come  to  satisfying  an  additive  constraint 
would  thus  be  a  diagnostic  check  for  global  sta¬ 
bility  of  the  transforms  in  additive  or  ACE  re¬ 
gression. 

Additive  principal  component  analysis  is 
closely  related  to  multiple  correspondence  aned- 
ysis  (Benzecri  1972,  Gifi  1981),  and  to  the  non¬ 
linear  principal  components  of  De  Leeuw  (1982a), 
both  of  which  consider  largest  principal  compo¬ 
nents  of  a  transformation  of  the  variables.  These 
techniques  have  been  developed  and  used  primar¬ 
ily  with  psychometric  data;  their  relationship  to 


APC  analysis  is  discussed  in  Section  4. 

Following  the  formal  definition  of  the  APC,  we 
give  a  brief  derivation  of  its  characterization  as 
cin  eigenfunction  of  a  compcict  operator  in  Section 
3.  In  Section  5  we  discuss  methods  of  estimation. 

The  fined  sections  are  of  a  more  practical  na¬ 
ture,  concerned  with  using  APC  analysis  as  an 
applied  method  :  Section  6  discusses  interpre¬ 
tation  techniques  which  we  use  to  interpret  the 
smallest  APC  of  a  data  set  in  Section  7. 

2  THE  POPULATION  ADDITIVE  PRIN¬ 
CIPAL  COMPONENT 

2.1  Motivation 

Strong  additive  dependence  in  a  set  of  variables 
exists  if  the  data  can  be  transformed  so  that 
X\,X2, ,  •  •  • ,  Xp  come  close  to  satisf)dng  an  addi¬ 
tive  constraint  (pi  (Xj)  =  0.  Our  objective  is  to 
characterize  the  set  of  unknown  transformations 

When  the  transformations  are  restricted  to  be 
linear,  we  simply  have  the  classical  problem  of 
the  analysis  of  coUinearity.  The  simple  rationede 
that  a  linear  combination  of  the  variables  with 
variance  near  zero  implies  the  variables  aie  nearly 
coUinear,  leads  to  the  criterion  : 

mm  var  subject  to  =  1- 

The  variables  are  usually  standardized,  some¬ 
what  arbitrardy,  to  have  E  (XJ  =  0  and 
var  (Xi)  =  1. 

The  minimum  occurs  for  a  an  eigenvector  for 
the  smallest  eigenvedue  of  cov(X)  =  E,  and  the 
random  variable  is  known  as  a  sm«dlest 

principal  component  of  X.  A  geometric  char¬ 
acterization  of  the  smallest  principal  component 
comes  from  observing  that  the  linear  function, 
li  (x)  =  a  ■  X,  of  the  minimizing  vector  a,  defines 
a  linear  mcinifold  of  co-dimension  1  in  p-space 
through  li  (x)  =  0.  This  linear  manifold  mini¬ 
mizes  the  expected  squared  distance  from  the  ob¬ 
servations  to  any  linear  mcinifold  of  co-dimension 
1. 

In  short,  there  are  three  characterizations  of 
the  vector  a  defining  the  principal  component  : 

•  ^OiXi  has  minimal  varirnce  among  all 
linear  combinations  of  the  variables  with 

•  a  •  X  =  0  defines  the  manifold  of  co¬ 

dimension  1  minimizing  expected  squared 
distance  to  the  data. 


•  a  is  an  eigenvector  for  the  smallest  eigen¬ 
value  of  E. 

A  natural  approach  for  a  generedization  of  prin¬ 
cipal  components  to  additive  functions  is  to  ex¬ 
tend  one  of  these  definitions  of  the  smallest  lin¬ 
ear  principal  component  of  X.  The  minimum 
variance  characterization  suggests  defining  #  = 
{<Pii  ■  ■  ■  t4>p)  as  the  vector  of  transformations  of 
the  variables  minimizing  var  ^4>i{Xi)  subject 
to  some  normalizing  constraint.  Alternatively,  a 
geometric  characterization  would  determine  the 
additive  manifold  described  by  the  constraint 
Y^<pi{Xi)  =  0,  which  minimizes  the  expected 
squared  distance  from  the  observations  to  any 
additive  manifold  of  co-dimension  1.  Unlike  the 
linear  case,  the  additive  functions  determined  by 
the  above  two  definitions  will  not  be  the  same. 

In  this  paper  we  use  the  minimum  variance 
definition,  which  has  two  useful  characteristics 
not  shared  by  the  geometric  approach.  First,  the 
minimum  variance  criterion  leads  to  a  character¬ 
ization  of  the  additive  principal  component  as  an 
eigenfunction,  from  which  we  gain  a  wecilth  of 
theoretical  insight  into  the  behavior  and  prop¬ 
erties  of  our  estimator.  Second,  finite  sample 
estimates  are  easy  to  compute,  since  the  crite¬ 
rion  involves  estimation  of  variance  rather  than 
estimation  of  the  euclidean  distance  between  a 
manifold  and  the  data. 

2.2  Definition  of  the  Smallest  Additive  Prin¬ 
cipal  Component 

For  simplicity,  we  will  aissume  each  additive  func¬ 
tion  of  an  APC  to  be  centered,  and  require  the 
variance  to  be  finite.  Formally,  the  APC-function 
of  the  t*'*  variable,  (/>,  (X,)  £  H  (X,),  where  : 

H  (X,)  C  {4>^  :  E  </>.  (X,)  =  0,  var  <^,  <  oo} 
=  L2(X,). 

The  vector  of  APC-functions,  ^(X)  = 
(<^i  (Xi) , . . .  (Xp)),  belongs  to  the  product 

space  defined  by  the  component  spaces  /f  (X,)  : 

4.(X)  fc  H{X) 

H(X)  1^'  (X,)  X  £f  (X2)  X  X //(Xp) 

C  L2(X). 

Definition  2.1  The  imallest  additive  princi¬ 
pal  component  of  X  =  (Xi,...,Xp)  in 

H(X)  is  the  random  variable  P{X)  = 

€  H{X,),  minimizing 

var  (X,)  subject  to  4>,  (X,)  = 

1  . 


5.i 


Note  that  the  constiaint,  ^vai  =  1,  is  a  Lemma  3.1  The  operator  P  :  H  (X)  h-*  H  (X) 
natural  analogue  to  the  linear  constraint,  5^  a?  =  defined  by  the  relationship 
1,  under  the  usual  assumption  that  ^lll  variables 

have  equal  variance.  For  if  4)i{Xi)  —  OiXi,  var  ^^<6,  = 

X)var  </)i  (Xj)  =  X!)varaiXi  =  ^a?varX<  = 

5^0?  =  1.  Restriction  of  jff(X)  to  linear  func-  is  ihe  mapping  : 
tions  reduces  to  a  definition  of  line<u;  principcd 

components  for  the  correlation  case.  __ 

[p$],  =  E 

3  THE  EIGEN  CHARACTERIZATION  ^ 


3.1  Introduction 

In  the  preceeding  section  the  definition  of  the 
APC  was  presented  as  a  natural!  extension  of  the 
linear  principad  component  to  additive  functions. 
It  is  not  unexpected,  then,  that  the  eigen  char¬ 
acterization  of  the  additive  principal  component 
can  be  derived  by  considering  an  extension  of  the 
eigen  characterization  of  the  linear  principal  com¬ 
ponent. 

Linear  algebra  gives  the  well  known  equiva¬ 
lence  between  the  statements  : 

minimize  var  subject  to  ^ 

and 

minimize  (a,  Ea)  subject  to  ||a|p  =  1 

where  {•,  •)  is  the  usual  euclidean  inner  product 
in  71^.  From  the  latter  statement,  it  follows 
that  the  vector  of  coefficients  a,  since  it  min¬ 
imizes  the  bounded,  symmetric  quadratic  form 
Q  (a)  =  (a,  Ea),  is  an  eigenvector  of  E.  Thus,  the 
linear  principal  component  solution  can  be  solved 
using  standard  linear  ^llgebraic  techniques. 

Analogously,  the  smallest  additive  principal 
component  can  be  characterized  by  either  of  the 
foUowing  criteria  : 

#  minimizes  var  <j),  (X,)) 
subject  to  Xli  var  4>i  (X,)  =  1 
and 

$  minimizes  {i,P^)n 
subject  to  =  1 

The  inner  product,  (■,  •),  in  the  above,  is  the  nat¬ 
ural  inner  product  on  the  product  space  of  the 
vector  of  APC-functions,  H  (X)  : 

=  E,  ('^•'^0  ■ 

The  corresponding  norm  is  ; 

=  Y^vBr4>,. 

%  » 

P  is  the  bounded,  symmetric,  linear  operator  of 
the  following  lemma. 


P  is  symmetric,  non-negative  definite  and 
bounded  above  by  p  . 

Proof 

=  E.(<A..(e  [E;<^,  1^.])) 

=  (e  [E><^j  1^.])] 

=  var  (53.  ,^0- 

P  is  bounded  by  p  : 

EJI^.E;-^;!!^ 

<  E.IIE;^>H^ 

=  p\\T.,h\?  ^ 

<  P  (E;II<^;1|)'- 

The  maximum  of  ||(^^  |1  under  the  constraint 
=  1  is  attained  at  Hifijll  =  p~^-  Hence, 

W'^nl  <  p(E,ll^;ll)' 

<  p'- 

The  inequcdity  is  sharp,  with  equality  occurring 
when  Xi  =  Xj,  <px  =  <t>j  V  i,  j. 

Symmetry  and  non-negativity  of  P  follow  from 
the  properties  of  var  (■).  ■ 

The  eigen  characterization  of  the  APC  now  fol¬ 
lows  cilmost  trivially. 

Theorem  3.2  The  smallest  eigenfunction  of  the 
operator  P,  if  it  exists,  is  a  vector  of  APC- 
functions  for  the  smallest  additive  principal  com¬ 
ponent  of  X. 

Proof 

Since  —  var  Ei Lemma  3.1, 

and  ^var.^,  =  |14>11^  by  definition,  a  func¬ 
tion  vector  «f>  £  ^  (X)  minimizes  (4>,P4[>)« 
subject  to  ll^lli/  iff  the  set  of  transformations 
<^2i  ■  •  ■  1  ^p}  minimizes  var  E'^»(-^«)  under 
the  constraint  ^  var  (X, )  =  1. 


From  the  theory  of  symmetric  operators,  Jor- 
gens  (1970),  Th  6.7  p.l25,  it  is  well  known  that 
$  G  H  {X.)  minimizing  ($,  P#)h  subject  to 
=  1  is  am  eigenfunction  for  the  smallest 
eigenvcilue  of  P  (  where  it  exists  )  .  ■ 

An  immediate  corollary  to  Theorem  3.2  is  : 

Corollary  3.3  Suppose  $  =  ■  >^p)  “ 

a  smallest  eigenfunction  of  P  belonging  to  the 
smallest  eigenvalue  A,  with  ||$|1  =  1.  Then  : 

1.  A  smallest  APC  ofX.  is  4  ~ 

2.  The  variance  of  this  smallest  APC  is  A. 

Proof  The  first  is  immediate;  for  the  second, 

var  ($,P#)h  =  ($,A$)//  =  A« 

A  further  consequence  of  the  eigenfunction 
property  of  the  smallest  principal  component,  is 
the  following  charaicterization  ais  a  solution  of  the 
APC  stationaj~y  equations. 

Corollary  3.4  A  smallest  additive  principal 
component  with  variance  A  satisfies  the  station¬ 
ary  equation  ; 

P$  =  A<^ 

Conversely,  any  #  satisfying  the  stationarity  con¬ 
ditions  for  minimal  X  <  1  is  a  smallest  APC  of 
X. 

The  stationary  equation  implies  a  strong  set  of 
identities  for  every  APC-function  ;  for  each  i, 
the  conditional  expectation  with  respect  to  X,  of 
the  APC  is  a  multiple  of  the  i*^  APC-fuuction, 
that  is, 

'^4>J  Vi. 

Moreover,  the  multiple  factor,  A,  is  constant  for 
all  of  the  APC-functions. 

Notice  that  if  the  smallest  eigenvalue  A^tn  ~ 
0,  the  conditional  expectations  of  the  smallest 
APC  with  respect  to  all  variables  are  almost  zero. 
In  this  sense  we  recall  our  initial  motivation;  to 
find  functions  that  come  close  to  satisfying  the 
constrmnt  =  0. 

3.2  Infinite  Dimensional  Function  Spaces 

We  now  address  the  issue  of  existence  of  the 
smallest  eigenspace.  If  H  (X)  is  finite  dimen¬ 
sional,  the  spectrum  of  the  operator  P  is  discrete, 
and  the  smallest  eigenspace  exists  and  is  dis¬ 
tinct.  However,  for  infinite  dimensional  //(X), 


although  the  spectrum  of  P  is  bounded,  the  ex¬ 
istence  of  the  smcdlest  eigenspace  is  complicated 
by  the  possibility  of  P  having  a  non-trivial  con¬ 
tinuous  spectrum  or  spectreil  values  that  are  not 
eigenvalues.  We  can  rule  out  these  undesirable 
possibilities  by  adopting  suitable  compactness 
cissumptions,  following  Breiman  and  Friedman 
(1985). 

Assumption  :  The 

restricted  operators  Pijk  '■  H  (Xj,)  i— ►  H  {Xi),  de¬ 
fined  by  Pi,k  (h(Xt))  =  E  (h(Xfc)  |  Xi)  are  com¬ 
pact  for  k  ^  i,  i  =  1, . . .  ,p. 

This  assumption  is  only  required  for  infinite  di¬ 
mensional  H  (Xj).  A  sufficient  condition  for  com¬ 
pactness  to  hold  is  given  in  Breiman  and  Fried¬ 
man  (1985).  It  is  strcdghtforward  to  show  that 
the  assumption  of  compactness  implies  the  spec¬ 
trum  of  P  is  essentially  discrete,  since  its  con¬ 
tinuous  spectrum  consists  of  at  most  one  point, 
namely  one.  A  smallest  eigenvalue  of  one  corre¬ 
sponds  to  the  null  situation  of  mutual  indepen¬ 
dence  of  all  variables. 

4  RELATED  LITERATURE 

The  idea  of  using  a  larger  claiss  of  functions 
in  principal  component  analysis  is  not  new:  a 
simple  polynomial  extension,  for  instance,  ap¬ 
pears  in  the  statistical  literature  in  Gnanadesikan 
(1977).  By  far  the  most  comprehensive  treat¬ 
ment  of  extensions  of  principed  components  anal¬ 
ysis,  however,  are  the  optimal  scaling  techniques 
developed  by  psychometricians.  Multiple  corre¬ 
spondence  ancilysis  (Benzecri  1972,Lebart  et  al. 
1984)  and  non-linear  principed  components  aned- 
ysis  (De  Leeuw  1982a,  Gifi  1981)  are  techniques, 
used  edmost  exclusively  with  categorical  data,  for 
determining  optimal  scalings  of  the  categories  — 
which  is  equivalent  to  estimating  step  functions 
of  the  variables  —  with  low  dimensional  struc¬ 
ture. 

In  this  paper  we  focus  on  the  smallest  APC, 
corresponding  to  the  smallest  eigenvcduc.  In  psy¬ 
chometrics,  the  intended  application  is  an  ex¬ 
tension  of  the  use  of  the  largest  linear  principal 
components  for  dimension  reduction.  The  largest 
APCs  are  clearly  interesting  in  their  own  right, 
however  their  interpretation  and  potential  appli¬ 
cations  are  very  different  to  those  of  the  smedl- 
est  APC.  Nonetheless,  we  acknowledge  that  these 
methods  from  psychometry  use  essentially  the 
same  notions  as  additive  principal  components. 

The  optimal  scalings  of  multiple  correspon¬ 
dence  ancdysis  are  equivalent  to  the  largest  APCs 
defined  over  the  finite  dimensional  function  space 
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spanned  by  normalised  indicator  functions  of 
the  variable  categories.  Multiple  correspondence 
an^Llysis  examines  the  largest  eigenfunctions  of 
the  corresponding  finite  dimensional  operator. 

The  non-linear  principal  components  or  PRIN- 
CALS  (Principal  Components  by  Alternating 
Least  Squares)  analysis  allows  only  one  set  of 
transformations  of  the  variables,  rather  than  the 
miiltiple  transformations  of  multiple  correspon¬ 
dence  anadysis.  The  transformations  Me  defined 
to  be  optimal  for  some  fixed  dimensional  repre¬ 
sentation,  d,  hence  for  d  —  I,  they  are  equiv¬ 
alent  to  multiple  correspondence  anedysis,  but 
for  d  ^  1,  PRJNCALS  gives  a  different  solu¬ 
tion.  De  Leeuw  (1982b)  has  extended  PRJN¬ 
CALS  to  continuous  variables.  The  functions  are 
estimated  using  a  finite  dimensional  B-spline  ba¬ 
sis,  hence  the  problem  can  be  recast  as  a  finite  di¬ 
mensional  eigenptoblem,  solvable  by  Unecit  tech¬ 
niques. 

5  ESTIMATION  OF  THE  ADDITIVE 
PRINCIPAL  COMPONENT 

5.1  Introduction 

The  smallest  additive  principal  component  cor¬ 
responds  to  the  smallest  eigenvalue  of  the  sym¬ 
metric  non-negative  definite  operator  P.  Thus, 
for  estimation  of  the  additive  principal  compo¬ 
nent  we  turn  to  known  methods  for  calculating 
eigenfunctions. 

If  all  the  function  spaces,  H{Xi),  of  the  APC- 
transforms  are  finite  dimensional,  estimation  can 
be  simplified  to  finding  the  smallest  eigenvector 
of  a  finite  dimensional  matrix.  We  also  present  an 
iterative  algorithm  based  on  the  power  method  of 
estimating  eigenfunctions,  an  approach  which  is 
valid  in  the  population  for  genered  ff(X). 

We  shall  not  discuss  the  stability  of  these  es¬ 
timation  methods  in  depth,  but  it  is  important 
to  bear  in  mind  that  estimating  the  smallest 
eigenfunction  is  an  intrinsically  unstable  prob¬ 
lem  when  the  second  smallest  eigenvadue  is  close 
to  the  smallest  eigenvalue.  Any  estimation  pro¬ 
cedure  will  have  difficulty  finding  a  unique,  stable 
estimate  of  the  eigenfunction  in  this  case. 

5.2  The  direct  solution  for  finite  dimen¬ 
sional  APC 

Assume  the  function  space  for  the  variable, 
H(Xi),  is  finite  dimensional.  Then  for  some  finite 
set  of  orthogonal  basis  functions, 

H(X,)  =  spanifM)  :  E  f.kiXi)  =  0, 

E(/.fc(X.)/,v(X0)  =0,1:=  l,...c 


Since  <p  €  H(Xi)  o  =  Y^t'-iCUkfik{Xi),  the 
APC  criterion  can  be  written  : 

=  var(X)iEfc'a.Jt/iA:(X.)) 

=  var  (X)j  Fi{Xi)ai) 

—  a‘var  (F(X))a 

where  Fi(Xi)  =  (ftiiXi) . . .  fidAXi)), 

a,-  =  (ail , . . . ,  Of,;, ), 

F(X)  =  (P’i(Xp),...,F’p(Xp)) 

a‘  =  (a‘.---.ap)- 

The  normalising  constraint  is  simply  : 

^  var  <j)i  =  ■  Y,'k= 1  f,k 

=  a‘a  =  1. 

Estimating  the  smallest  APC  simplifies  to  ced- 
culating  the  smallest  linear  principal  compo¬ 
nent  of  the  basis  vectors,  J^(X).  The  small¬ 
est  APC  is  the  smcdlest  linear  principal  compo¬ 
nent  of  F{X.):  for  the  eigenvector  a,  the  APC 
is  the  APC-function, 

<f>i{Xi)  =  F.(Xi)a.. 

Finite  sample  estimation  is  straightforward: 
express  each  bcisis  vector  as  a  functional 
of  its  distribution  function,  Fi,  /a(Xi)  = 
^tJk(^i)-  Replacing  Fi  with  the  empirical  dis¬ 
tribution  function,  yields  finite  sample  es¬ 
timates  :  /tfc(xj)  =  Cik{F^).  An  APC  esti¬ 
mate  can  be  obtained  from  the  eigen  decom¬ 
position  of  the  correlation  matrix  of  .F(x)  = 
(filial)  Jn{x2)  ■■■  fpdA^p))- 

5.3  The  iterative  method 

Iterative  calculation  of  the  smcdlest  eigenfunc¬ 
tion  uses  a  componentwise  minimization  scheme, 
where  each  function  is  estimated  in  turn,  using 
the  function  estimates  of  the  previous  iteration. 
The  iterative  approach  is  important  both  because 
it  enables  estimation  for  a  class  of  functions  that 
are  only  constrained  to  be  “smooth”,  and  be¬ 
cause  it  provides  an  alternative  to  the  expense 
of  an  eigen  decomposition  when  the  dimension  of 
ff(X)  is  iMge. 

5.3.1  A  power  algorithm 

It  is  easily  shown  for  a  symmetric,  non-negative 
operator,  that  for  some  ini  tied  the  sequence  : 

p<=*(o) 

]fe  =  1  2 

I,P*$(0)1|  ’ 

converges  a.s.  to  the  eigenfunction  of  P  belong¬ 
ing  to  the  maximal  eigenvalue.  This  can  easily 
)be  adapted  to  find  the  eigenfunction  belonging 
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to  the  smallest  eigenvalue,  since  there  is  a  simple 
linear  relationship  between  the  eigenvalues  and 
eigenfunctions  of  P  and  pi  —  P. 

The  eigenvalues  of  P  are  non-negative  and 
bounded  above  by  p.  For  0  an  eigenfunction  of 
P  with  eigenvcdue  A, 

(pi  -  P)0  =  p0  -  P0  =  (p  -  A)0. 

It  follows  that  P  and  pi  —  P  have  common  eigen¬ 
functions,  however  the  order  of  eigenvalues  for 
the  shifted  operator,  pi  —  P,  is  reversed.  Thus, 
the  sequence  : 

(pi  -  P)'=$(0) 

||(pI-P)*=$(0)|| 

converges  to  the  smallest  eigenfunction  of  P.  The 
value  p  can  be  replaced  by  the  largest  eigenvalue 
of  the  operator  P,  in  any  specific  problem,  which 
will  improve  the  rate  of  convergence  dramati¬ 
cally. 

The  iteration  scheme  employed  is  an  alter¬ 
nating  conditional  expectation  algorithm,  in  the 
same  vein  as  the  algorithm  used  to  ^estimate 
ACE  (Alternating  Conditional  Expectation)  re¬ 
gression.  Algorithmically,  the  sequence  is  gener¬ 
ated  CIS  follows  ; 

Algorithm 

Choose  initial  transformations  . . . , 

Repeat  1  and  2  for  A  =  1,  2, . . .  [Outer  Loop  ] 

(l)Do  {oi  i  =  ,p[Inner  Loop  ] 

(2)Standardize 

—  (c4>uc4>2,---,c4>p) 

where  c  =  ^ 

Until  var  ^  converges. 

Note  that  while  in  the  ACE  regression  algorithm, 
each  4>i  is  updated  to  its  new  transformation  as 
the  inner  loop  proceeds,  we  obtain  the  new  p- 
tuple  using  only  the  previous  p-tuple  throughout 
the  entire  inner  loop. 

The  iterative  cilgorithm,  as  a  version  of  the 
power  algorithm,  shares  its  shortcomings  as  a 
method  of  estimation:  it  is  prone  to  difficulties 
associated  with  finding  local,  rather  than  global 


stationary  points,  hence  it  can  be  sensitive  to 
starting  vedues;  convergence  will  be  slow  when 
neighbouring  eigenvalues  are  close. 

Obtaining  finite  sample  estimates  using  the 
iterative  method  essentially  entails  choosing  a 
method  for  estimating  the  conditional  expecta¬ 
tion  term  of  the  inner  loop.  If  R(X)  is  finite 
dimensional,  this  is  easily  done.  If  £f(X)  has  in¬ 
finite  dimension,  approximate  solutions  are  com¬ 
puted  using  smoothing  techniques. 

5.3.2  Finite  dimensional 

For  the  finite  dimensional  H{Xi),  each  condi¬ 
tional  expectation  operator  has  the  decomposi¬ 
tion  : 

E(^|X0  =  X^(.^,/.fc)/.fc. 

k 

SO  each  inner  loop  step  is  simply  a  linear  least 
squares  regression  of  <^  =  4>i  on  fa  . . .  /.j,  : 

Noting  that  fik  i.  fik',  it  is  easily  shown  the  inner 
product,  {},  fik)  are  the  coefficients  of  the  linear 
least  squares  regression  of  fti  ■  ■  ■  fidi- 

Finite  sample  estimates  are  obtained  in  the  ob¬ 
vious  way,  the  inner  loop  step  is  simply  : 

;  =  i 

and  =  ar'"’A 

5.3.3  Infinite  dimensional 

A  powerful  and  practical  alternative  to  finite 
dimensional  estimation  techniques  is  estima¬ 
tion  of  conditional  expectations  using  scatterplot 
smoothers. 

Let  5,  denote  a  smoother  with  respect  to  X,. 
The  inner  loop  step  is  implemented  as  : 

Since  5,  is  typically  not  a  projection  operator, 
it  is  important  that  is  excluded  from  the 

smoothed  terra.  The  value  r  is  an  estimate  of  the 
largest  eigenvalue  of  P. 

The  fulvantages  of  using  a  smoother  for  esti¬ 
mation  in  terms  of  flexibility,  interpretability  and 
cost  are  obvious.  The  disadvantage  is  that  most 
smoothers  are  non-linear,  hence  mathematical 
analysis  of  the  estimation  procedure  is  usually 
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not  feasible.  Our  experience,  however,  matches 
that  of  Breiman  and  Friedman  (1985)  with  the 
ACE  algorithm:  with  good  starting  guesses  the 
iterative  procedure  generciUy  converges  to  accept¬ 
able  estimates  of  the  minimizing  functions. 

We  have  presented  both  a  direct  and  iterative 
method  for  computing  APC  estimates,  which  are 
equivalent  when  the  function  spaces  are  aissumed 
known  and  finite.  In  practice,  the  iterative  algo¬ 
rithm  implemented  with  a  scatterplot  smoother 
may  be  a  preferable  method,  particularly  for  ex¬ 
ploratory  analysis,  since  smoothing  techniques 
place  far  fewer  restrictions  on  the  function  space. 
Unfortunately,  justification  for  this  procedure  is 
heuristic  for  the  most  part,  as  smoothers  are  not 
usually  projection  operators.  Nevertheless,  in  the 
ensuing  examples,  we  use  the  iterative  algorithm 
with  a  scatterplot  smoother  for  estimation  of  the 
conditional  expectation.  1 

6  INTERPRETATION  OF  ESTIMATES 

Using  the  smallest  APC  as  an  applied  tech¬ 
nique  for  multivariate  analysis  of  a  dataset,  re¬ 
quires  careful  consideration  of  the  properties  and 
interpretation  of  the  estimators  characterizing 
the  APC:  the  eigenvalue;  the  APC;  the  APC- 
functions.  However,  unlike  a  lineeir  analysis,  ex¬ 
amining  these  estimates  alone  is  not  sufficient 
to  infer  the  dependencies  between  the  variables. 
The  APC  determines  a  dependency  linear  in  the 
transformed  variables,  so  unless  the  transforms 
themselves  are  linear,  translating  this  depen¬ 
dence  to  the  otigincil  space  is  far  from  easy.  We 
suggest  a  graphical  technique  using  simultaneous 
highlighting  of  plots  to  aid  in  understanding  the 
concurvity. 

6.1  The  estimates 

1.  Eigenvalue  :  var 

The  eigenvcdue  measures  the  strength  of  con¬ 
curvity,  and  by  definition  is  bounded  be¬ 
tween  0  and  1  :  0  corresponds  to  exact  con¬ 
curvity,  1  to  mutucd  independence  of  trans¬ 
formed  variables.  The  size  and  spacing  of 
different  eigenvalues  can  warn  about  poten¬ 
tial  difficulties  with  stability  and  uniqueness 
:  since  A  =  1,  instability  becomes  more  likely 
as  A,  approaches  1. 

2.  APC  : 

The  smallest  APC,  by  definition,  has  min 
imal  variance,  hence  interpretation  of  the 
APC  vector  is  akin  to  a  residual  aneiiysis. 


Ideally,  it  will  be  distributed  symmetriccdly 
about  zero;  departures  from  symmetry,  such 
as  outliers  or  grouping  in  the  A.PC,  indicate 
cases  which  are  unusual  with  respect  to  the 
concurvity  relation. 

3.  APC-function  weights:  sd(<^,(X,)) 

The  relative  scale  of  the  APC-transforms,  as 
mecisured  by  the  APC-function  weights,  in¬ 
dicates  the  relative  importance  of  the  vari¬ 
able  in  the  APC  :  a  zero  weight  indicates  no 
contribution,  a  large  weight,  a  large  influ¬ 
ence. 

4.  APC-functions  :  4>i{Xi) 

Plotting  (pt{X,)  versus  A,  reveals  the  shape 
of  the  transform,  which  can  indicate  the  sen¬ 
sitivity  of  the  values  of  X,  in  the  depen¬ 
dence  :  a  step  function  indicates  sensitiv¬ 
ity  only  between  corresponding  levels  of  the 
variable,  an  asymptote  defines  a  region  of 
relative  insensitivity. 


6.2  Interpretation  using  graphics 

Suppose  for  a  data  set,  x,  we  have  estimated  the 
smallest  APC,  and  its  eigenvalue  is  small,  imply¬ 
ing  near  concurvity  between  the  variables.  The 
transformed  data,  $(x)  have  the  strongest  lin¬ 
ear  dependence  achievable  — how  can  we  inter¬ 
pret  what  this  linear  dependence  of  transformed 
variables  implies  for  the  relationship  between  the 
original  variables  ?  Simultaneous  highlighting  of 
scatterplots  of  the  data  facilitates  the  interpreta¬ 
tion  of  the  concurvity. 

Simultcineous  highlighting  requires  a  graph¬ 
ics  capability  that  is  most  naturaUy  suited  to  a 
high  resolution  graphics  terminal  equipped  with 
a  flexible  pointer  device,  such  cis  a  mouse;  how¬ 
ever  it  can  also  be  effective  with  static  plots.  For 
a  set  of  plots  displaying  different  variables  of  the 
same  data  set,  we  want  to  select  any  group  of 
cases  in  any  plot  and  have  the  selected  cases  high¬ 
lighted  in  all  the  remaining  plots;  thus  a  subset  of 
cases  are  highUghted  simultaneously  in  all  plots. 
Selection  and  highlighting  are  usually  indicated 
by  a  change  in  color,  size  or  symbol  of  the  se¬ 
lected  cases. 

The  use  of  simultaneous  highlighting  for  inter¬ 
preting  an  APC  is  best  illustrated  by  a  simple 
example  in  three  variables. 
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Figure  1:  The  APC-function  plots  of  the  smallest  APC.  var(53i  “^t)  =  0.084.  Highlighting  of  large 
values  of  (/ii(Xi)  indicates  a  strong  relationship  between  Xi  and  X2 


6.3  An  example:  Interpretation  for  a  three 
variable  APC 

The  smallest  APC  of  Xi,X2,X3,  estimated  using 
the  iterative  algorithm  with  the  supersmoother, 
has  variance  0.084  —  hence  the  data  almost  lie  on 
a  surface  in  3-space.  The  variable  weights  are  : 
sd  <pi  =  0.71,  sd  4>2  =  0.71,  sd  ^3  =  0.03, 
so  we  can  conclude  the  third  variable  is  not  im¬ 
portant  in  determining  the  relationship  between 
the  variables.  The  APC-functions  are  plotted  in 
Figure  1. 

For  the  moment,  consider  (p(x)  and  x  to  be 
distinct  data  sets  :  the  former  has  a  strong  linear 
structure  which  we  want  to  use  to  explore  the 
structure  of  its  untransformed  version.  Display 
the  two  data  sets  together  in  the  3  scatterplots  : 
4>i  vs  Xi,  <^2  vs  X2,  <t>z  vs  X3;  so  <^(x)  appears  in 
the  horizontal  marginal  projection,  x  in  the  ver¬ 
tical  marginal  projection.  The  smadl  variance  of 
the  APC  impUes  (t>i,4>2  and  are  edmost  linearly 
dependent;  4>i  +  <f)2-i-<p3  ^  0.  As  sd(<^3)  is  small, 
the  values  of  <p3  are  always  close  to  zero,  hence 
low  vedues  of  <pi  will  constrain  vedues  of  <^2  to 
be  high.  Selecting  low  values  of  <pi,  as  in  Figure 
1,  and  simultaneously  highlighting  the  selected 
cases  in  the  other  plots  illustrates  this  constraint. 

Now,  highlighting  enables  us  to  use  the  lin¬ 
ear  dependence  of  the  transformed  variables  to 
reveal  the  dependence  between  the  original  vari¬ 
ables.  In  Figure  1,  low  values  of  occur  when 
Xi  is  extreme  (either  high  or  low);  high  values 
of  4)2  when  X2  is  central;  in  the  plot  of  X3,  the 
selected  points  are  evetdy  spread  cdong  the  hori¬ 
zontal  axis,  confirming  the  observation  that  this 
variable  does  not  determine  the  concurvity.  The 


constraint  —  0,  for  low  <^i,  can  be  inter¬ 

preted  in  the  vMiables  x  :  Ccises  with  values  of 
X2  near  zero  will  have  either  high  or  low  values  of 
Xi.  Continuing  with  selection  of  cases  by  condi¬ 
tioning  on  values  over  the  entire  range  of  4)i,  we 
can  understand  the  configuration  of  the  variables 
in  the  original  scaling.  In  this  case,  the  relation¬ 
ship  between  xi  and  X2  is  easily  understood  to  be 
circular,  hence  the  variables  lie  on  a  cylinder  ori¬ 
ented  lengthwise  along  the  X3  axis.  Figure  2.  In 
general,  conditioning  on  the  values  of  each  trans¬ 
form  in  turn,  much  more  complex  relationships 
can  be  explored  using  this  technique. 

7  Boston  Housing  Data 

The  variables  for  this  example  are  the  variables 
selected  by  Breiman  and  Friedman  (1985)  as  ex¬ 
planatory  variables  for  median  housing  values  in 
Boston  • 

Noxsq  Nitrogen  Oxide  concentration  in  pphm 
squared. 

Tax  Pull  Property  Teix  rate 

Ptratio  Parent  Teacher  ratio  of  the  town  school 
district 

Lstat  Proportion  of  population  that  is  of  lower 
status 

Roomsq  Average  number  of  rooms  squared 

The  smaUest  APC  of  the  five  variables  is  esti¬ 
mated,  and  shown  in  Figure  3.  The  variance 
of  the  APC  is  0.035,  hence  there  is  strong  ev¬ 
idence  that  dependencies  exist.  The  estimated 
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Figure  2:  The  configuration  of  the  dataset  in  Example  6.1.  The  variables  lie  on  the  surface  of  a 
cylinder 


transforms  indicate  Tax  as  the  major  dependent 
variable  in  the  smallest  APC.  This  transform  sep¬ 
arates  the  two  highest  Tax  values  from  the  rest 
of  the  data  —  highlighting  of  these  values  show 
these  cases  have  an  almost  exact  correspondence 
in  Ptratio,  and  have  relatively  high  values  of  the 
pollution  indicator,  Noxsq.  The  smallest  APC 
has  identified  a  group  of  town  districts  (  the  dis¬ 
tricts  explain  the  close  correspondence  between 
Tax  and  Ptratio)  with  high  property  tcixes,  and 
high  pollution. 

The  interpretation  through  highlighting  de¬ 
pends  quintessentially  on  the  additivity  of  the 
APC  relationship  :  from  this  follows  the  linear¬ 
ity  between  transformed  variables  that  leads  to 
the  highlighting  method.  The  essence  of  the  idea 
is  that  the  APC-trcinsforms  guide  selection  of  the 
cases  for  each  variable,  so  that  the  nature  of  the 
dependence  is  clear. 

8  CONCLUSION 

As  a  natural  extension  of  linear  principal  com¬ 
ponent  methodology,  additive  principal  compo- 
nants  have  many  potential  fields  of  application. 
In  psychometrics,  where  multiple  correspondence 
analysis  and  non-linear  principal  components 
ancilysis  have  gained  wide  acceptance,  the  utility 
of  APC  techniques  does  not  have  to  be  argued. 
The  use  of  the  smallest  APC  as  a  nieihod  for  de¬ 
tecting  high  dimensional  structure  in  data,  struc¬ 
ture  that  cannot  be  e^lsily  detected  even  with  so¬ 
phisticated  graphical  tools,  is  a  novel  approach 
to  multivariate  data  analysis. 

The  elegant  Hilbert  space  theory  underpinning 


the  APC  presents  a  strong  case  for  this  general¬ 
ization  of  bnear  principal  components.  The  char¬ 
acterization  as  an  eigenfunction  provides  a  large, 
well  understood  body  of  literature  with  which  to 
approach  theoretical  considerations,  and  the  task 
of  estimation. 

As  an  applied  method,  concern  centers  on  two 
issues:  the  reliability  of  the  estimates  and  the  ac¬ 
cessibility  of  the  information  it  provides.  Meth¬ 
ods  of  assessing  reliability,  based  on  asymptotic 
results  for  eigenvalue  estimation,  are  well  known 
for  the  direct  estimation  methods;  there  is  a  lack 
of  such  results  for  general  smoothing  techniques. 

The  interpretation  techniques  we  have  pre¬ 
sented  are  a  first  attempt  at  providing  a  read¬ 
ily  accessible  method  for  understanding  the  non¬ 
linear  dependencies  of  the  smallest  APC.  This 
task  of  interpretation  is  not  an  easy  one,  clearly 
there  are  many  approaches  yet  to  be  explored. 
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STOCHASTIC  TESTS  OF  FIT 


P.W.  Millar,  University  of  California 


0.  Introduction. 

This  paper  describes  a  method  for  using  controlled 
randomization,  coupled  with  computationally  intensive 
methods,  to  resolve  computational  problems  arising  from 
a  broad  class  of  goodness  of  fit  tests.  Since  the  models 
whose  fitness  is  being  assessed  here  are,  in  general,  non- 
parametric,  a  certain  amount  of  care  is  necessary  in 
choosing  methods  of  numerical  implementation.  Issues 
surrounding  the  choice  of  method  are  discussed  in  sec¬ 
tions  1,2.  The  main  new  result  (cf.,  sec.  3)  is  a  very  gen¬ 
eral  asymptotic  representation  theorem  which,  among 
other  things,  can  be  used  to  justify  asymptotically  both 
the  methods  proposed  and  the  validity  of  bootstrap 
methods  for  calculating  critical  values  from  the  approxi¬ 
mating  expressions. 

1.  Computational  difficulties  in  certain  tests  of  fit. 

Let  xj,  X2,  .  .  .  ,  Xn  be  iid,  R‘’-valued  random  variables 
with  unknown  common  distribution  G.  Let  0  be  an 
index  set,  and  (Pg,  0  «  0)  a  statistical  model  that  can  be 
either  parametric,  semiparametric,  or  non-parametric.  An 
important  question  is  to  decide  whether  the  model  (Pg) 
fits  the  data;  more  precisely,  it  is  desired  to  test  the  null 
hypothesis  that  G  belong  to  {Pg,  0  e  0}. 

A  reasonable  class  of  tests  can  be  described  as  fol¬ 
lows.  Let  B  be  a  Banach  space,  Tg  a  B-valued  function 
of  0,  and  Tp  a  B-valued  function  of  the  data  X|,  .  .  .  ,  x„. 
If  1  •  I  denotes  the  norm  of  B,  then  a  plausible  goodness  of 
fit  statistic  is 

(1.1)  info'^^lt^-Tgl, 

0<  0 

the  hypothesis  being  rejected  for  large  values.  While  this 
recipe  is  rea.sonable  in  a  large  number  of  situations,  the 
statistic  (l.I)  is  incomputable,  in  general,  for  several  rea¬ 
sons.  To  understand  the  computational  difficulties  sur¬ 
rounding  (1.1),  and  to  understand  the  computationally 
intensive  substitutes  for  (1.1)  which  we  develop  later  on, 
let  us  first  look  at  several  examples  of  statistics  of  the 
form  (1.1). 

Example  1.1:  classical  statistics.  Let  the  data  be  real 
valued,  let  (Pg}  be  a  parametric  family  of  probabilities  on 
the  line.  Let  Tg  be  the  cdf  of  Pg:  Tg(t)  =  Pg  |xj  <  t)  and 
let  T„  be  the  empirical  cdf  of  the  d.ata; 
t„(t)  =  n“'  Zl  [Xj  <  t).  The  Banach  space  B  above  can 
be  taken  to  be  the  collection  of  real  bounded  functions  f, 
with  norm  |f|  =  sup|f(t)|  (i.e.,  B  =  L„„(R')).  Then  (1.1) 

t 

becomes  the  usual  Kolmogorov-Smirnov  goodness  of  fit 
statistic,  whose  properties  are  well  known  if  0  consists  of 
a  single  point. 

To  obtain  a  different  classical  statistic  in  this  frame¬ 
work,  let  p  be  a  probability  on  R',  and  let  B  be  the  space 
of  real  functions  f  with  norm  |fp  =  jf(s)^p(ds)  (so  now 
B  =  L^(p)).  Then,  with  Tg,  tn  as  l^fore,  and  the  norm 
just  given,  (1.1)  is  a  variant  of  the  Cramer-von  Mises 
goodness  of  fit  statistic. 


For  the  remaining  examples,  we  shall,  for  convenience 
deal  mainly  with  one  particular  space  B.  To  describe  it, 
let  S'*  =  (s  e  R"*:  |s|  =  1}  be  the  unit  spherical  shell  of 
R**.  Define  the  halfspace  A  (s,  t)  by 

(1.2)  A(s,t)  =  (x  6  R<*;  x's  <  t). 

If  P  is  any  probability  on  R**  define  P  (s,  t)  by 

(1.3)  P(s,t)  =  PlA(s,t)). 

By  this  means,  the  probability  P  is  identified  with  an  ele¬ 
ment  of  L^iS"*  X  R’),  the  Banach  space  of  bounded  con¬ 
tinuous  function  on  S'*  x  r’  with  supremum  norm.  For 
d-dimensional  data,  half-spaces  are  the  simplest  class  of 
sets  which  remains  invariant  under  affine  transformation 
and  which  separates  measures:  if  P(A)  =  Q(A)  for  all 
halfspaces  A,  then  P  and  Q  are  ider  '  as  measures  on 
the  Borel  sets.  In  certain  problems,  such  as  the  location 
model  below,  halfspaces  fit  in  with  the  structure  of  the 
model  in  an  elegant  way,  leading  to  a  much  simpler 
analysis  than  one  based,  e.g.,  on  lower  left  quadrants  (cf 
(2.2c)  below). 

Example  1.2:  parametric  models.  Let  0  be  a  subset  of 
R'*.  so  that  (Pg,  0  6  0)  is  a  parametric  model.  A  good¬ 
ness  of  fit  test  of  the  form  (1.1)  is  then 

(1.4)  inf  n’^^suplfnls.t)  -  T0(s,t)| 

es  e  s.t 

where  Tg(s,_t)  =  Pg  lA(s,t)),  and  f^ls.t)  =  P„  {A(s,t)) 

and  where  is  the  empirical  measure  of  X] . x„. 

One  especially  important  case  is  when  (Pg)  is  the 
collection  of  normal  distributions  on  R'*  with  unknown 
mean  and  covariance.  Another  important  example,  dis¬ 
cussed  briefly  later  on,  concerns  the  Fisher  distributions 
on  the  unit  spherical  shell  in  3  dimensions.  In  this  latter 
case,  the  supremum  in  (1.4)  becomes 
s^n'''^(P„(C)  -  Pe(C)|  where  C  ranges  over  all  spheri¬ 
cal  caps  on  S^. 

Example  1.3:  symmetric  location  models  on  R**.  In  R'* 
there  arc  many  notions  of  symmetry  for  random  variables 
—  for  example, 

(a)  “simple  symmetry”:  the  rvX  has  the  property  that 
both  X  and  -X  have  the  same  distribution 

(b)  “isotropy”:  rvX  has  the  property  that  X  and  yX  have 

the  same  distribution  for  every  orthogonal  transformation 
y.  See  Beran  and  Millar,  1988c,  for  further  develop¬ 
ments.  Let  Eg  denote  the  collection  of  all  probabilities  on 
R'*  that  are  “symmetric”  according  to,  say,  one  of  the 
possibilities  just  suggested.  The  F„  symmetry  model 
asserts  that  for  some  unknown  q  e  R"*  and  some  unknown 
F  e  Eg,  the  centered  data  x,  -  q . ’'i,  ~  have  distri¬ 

bution  F.  The  parameter  .set  0  then  consists  of  all  pairs 
0  =  (1),  F),  q  6  R"*  F  E  Eg;  Pg  is  the  probability  given  by 
PgfA)  =  F(A  -  q),  and  (1.1)  becomes 

(1.5)  inf  n''^sup|in(s,t)  -  Tg(s,t)| 

(T|.  F)  5,1 

where,  as  usual  Tg(s.  t)  =  Pg(A(s,t)),  tn(;,,t)  = 

P„(A(s.t)). 
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Example  1.4:  Logistic  model.  Here  the  data 

xi . Xn  takes  the  form  Xj  =  (yi.Zj).  where  yj  takes  on 

only  the  values  0,1  and  where  Zj  are  iid  R**- valued  with 
distribution  F.  If  P  =  (Pq,  pj,  .  .  .  ,  P^)  e  and  if  F  is 
a  probability  on  R**,  then  the  logistic  model  asserts  that 
P(yi  =  1  K)  =  P(P.Zi).  where  log  P(p;z) 
[1-P(p.z)r'  =Po+P,'z.  with  Pi  =  (p,.  .  .  .  ,  Pi). 
Then  6  =  (p,  F)  where  p  e  and  F  is  the  probability 
governing  the  covariates  Zp  .  .  .  ,  Zj.  The  family  Pg  is 
given  by  Pglyj  =  1,  zj  e  A)  =  |P(P;  z)F(dz).  Define 

A 

T4(s,t)=  I  [  1  -  P(P;z)l‘-'  P(P;z)'F(dz),  i  =  0. 1 

A(s,t): 

and  T^(s,t)=:  n  (vj  =  i,Zj  €  A(s,l)j,  i  =  0. 1.  Set 
Te  =  (Tg,T0),  Tn  =  (Tg,te),  random  elements  in  the 
Banach  space  B  =  x  RO  x  L„(S‘'  x  r‘).  With  the 

Banach  space  just  mentioned  (the  norm  is  the  maximum 
of  the  norms  of  the  factor  spaces),  the  test  statistic  (1,1) 
for  the  logistic  model  is,  with  the  above  choices  of  T„, 
Tg,  B,  given  by  (1.5) 

With  these  examples  behind  us,  we  can  now  easily 
see  the  computational  difficulties  surrounding  (1.1).  First, 

the  computation  of  inf  will,  for  the  suprenium-type  norms 
8 

described  above  be  intractable.  In  the  location  and  logis¬ 
tics  models,  this  calculation  involves  an  infimum  over  an 
infinite  dimensional  collection  of  probabilities.  Actually, 
this  particular  calculation  is  already  intractable  (for 
supremum  norms)  in  the  Gaussian  and  Fisher  cases 
described  in^  example  1 .2.  Second,  the  calculation  for 
fixed  0  of  |Tn  -  Tgi,  is  also  intractable;  in  the  examples 
1.2- 1.4,  it  involves  computing  the  supremum  over  tlie  col¬ 
lection  of  half-spaces  in  R**,  Finally,  even  the  computa¬ 
tion  of  Tg(s,  t)  3  Pg(A(s,  t))  in  examples  1,2- 1.4  is  typi¬ 
cally  intractable.  In  Gaussian  parametric  situations,  the 
calculation  is  simple,  because  of  properties  of  the  normal 
distribution;  on  the  other  hand,  for  the  Fisher  distributions 
on  S^,  the  calculation  of  Tg  (s,  t)  (the  mass  given  a  partic¬ 
ular  “spherical  cap”  by  a  particular  Fisher  distribution)  is 
intractable  and  must  be  obtained  by  approximation. 
Equally  difficult  computational  difficulties  can  arise  in  the 
logistic  and  location  models. 

This  paper  describes  very  general,  comoutationally 
intensive  methods  which  can  successfully  confront  the 
numerical  intractability  of  (1.1).  These  methods  involve 
(a)  replacing  the  parameter  set  0  by  a  (random)  subset 
0n  (b)  replacing  the  norm  |  ■  |  of  B  by  a  random  norm  |  •  1„ 
and  (c)  replacing  the  functional  Tg  by  an  approximating 
(random)  functional  T^"*,  The  computationally  feasible 
replacement  of  (1.1)  takes  the  form 

(1.6)  minn''^|t„-T^'’)|n- 

Because  of  the  possible  infinite  dimensionality  of  0,  the 
choices  of  |  ©„  must  be  made  with  some  care;  issues 

surrounding  these  choices  are  discussed  in  section  2. 

2.  Stochastic  Methods. 

This  section  discusses  issues  surrounding  the  choice 
of  0n,  Tg"\  and  hln  the  formula  (1.6) 

(2.1).  Search  of  ©.  Henceforth,  let  us  call  i)  .  ,i,’  ci 


0„  c  ©  used  to  construct  (1.6),  a  search  set  for  0;  if  0„ 
is  random,  we  call  it  a  stochastic  search  set.  Throughout 
the  test  of  the  paper,  0  is  assumed  to  be  a  subset  of  a 
normed  space. 

Search  method  (a);  sieves  on  0.  Although  the  statistic 
(1.1)  is  generally  incomputable,  it  is  often  theoretically 
intractable  as  well,  because  either  the  differentiability 
hypothesis  of  standard  minimum  distance  theory  may  fail 
on  0,  or  else  the  non-singularity  hypothesis  fails;  for 
example  differentiability  problems  arise  in  the  location 
model  and  non-singularity  fails  for  certain  regression 
models  (cf.  Millar  1982).  (These  concepts  are  defined  in 
section  3).  It  may  be  possible,  however,  to  find  (non- 
random)  subsets  ©n  T  ©,  such  that  these  hypotheses  hold 
on  0„  sufficiently  well  that  an  asymptotic  analysis  may 
proceed,  albeit  with  technical  complications.  The 
difficulties  are  reminiscent  of  those  found  in  maximum 
likelihood  estimation  on  infinite  dimensional  0  (cf. 
Grenander,  1981;  Geman,  Huang,  1982)  where  the  ©„  so 
introduced  are  called  “sieves”. 

The  theoretical  difficulties  have  a  counterpart  in  sound 
intuition:  it  is  undesirable  to  “over  fit”  the  model  relative 
to  the  data  at  hand.  While  sieve  methods  are  of  consider¬ 
able  theoretical  interest,  they  frequently  leave,  in  the 
situation  of  goodness  of  fit  statistics  like  (1.1),  a  computa¬ 
tional  problem  as  bad  as  the  original  one 

Search  method  (b):  simple  searches  of  0.  A  different 
modification  of  (1.1)  is  to  replace  inf©  by  a  minimum 
over  a  finite  subset  ©„  c  0. 

How  should  a  finite  search  set  be  chosen?  The  set  0 
is  infinite  dimensional,  perhaps  bounded,  but  it  may  not 
be  precompact.  In  this  case  it  would  not  be  possible  to 
construct  a  finite  e-grid  over  0.  Even  if  0  were  compact, 
and  thus  there  exists  such  a  grid,  actual  construction  of  it 
could  be  formidable,  except  in  very  special  cases.  For 
example,  construction  of  an  e-grid  over  all  the  probabili¬ 
ties  on  the  unit  ball  of  R**  appears  to  be  intractable;  such 
a  difficulty  can  arise  in  both  location  and  logistic  models. 
Finally,  construction  of  e-grids  is,  in  general,  a  much 
more  ambitious  undertaking  than  the  one  required  to  pro¬ 
vide  a  decent  approximation  to  inf. 

0 

Another  suggestion  might  be  to  construct  iid  0-valued 

random  variables  Y, . Yj ,  and  take  ©„  to  consist  of 

the  values  of  this  sample.  Except  in  special  cases,  there 
may  be  difficulty  in  carrying  out  this  construction  in  a 
computationally  feasible  way. 

A  more  fundamental  difficulty  centres  on  the  fact  mat 
if  Pg^  is  the  actual  data  distribution,  then  typically 

inf|T„-To|  achieves  its  minimum  within  a  ball  of 

0.0 

radius  cn"''^  about  %  (see  section  3),  On  the  other  hand 
it  can  be  shown  (Millar,  1988)  that  for  many  bounded, 
infinite  dimensional  0,  the  Y, -.search,  1  <!<)„,  will  in 
general  miss  this  crucial  n^''^-ball  with  positive  probabil¬ 
ity,  no  matter  how  fast  Jn  More  precisely,  for  any 

sequence  {)„),  there  exists  0(,  and  a  sequence  {a„), 
a^  »■  n”’'^,  such  that  Hrn  P  II Y,  -  0,,  1  >  a^  for  some 
i  s  in)  >  0. 

Seorrh  method  (c):  local  stochastic  search.  A  more 
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promising  approach,  which  is  justified  by  the  result  of 
section  3,  depends  upon  the  fact  that,  if  Qq  is  the  ‘true’ 
parameter,  then  the  minimizing  point  of  n''^|Tn-Tg| 
occurs  within  a  neighborhood  of  diameter  n"''^  about  Sq. 
Thus  search  methods  which  stray  outside  such  neighbor¬ 
hoods  waste  time  searching  unimportant  parts  of  0.  One 
way  to  capitalize  on  this  property  is  to  suppose  that  there 
exist  estimates  0„  =  0n  (Xn)  of  0  e  ©,  with  values  in  0, 
such  that,  whenever  |0„  -  0ol  <  cn"''^,  0^  £  0: 

(2.1)  n'^^IOn  -  0„|  is  tight  under  Pe_^ 

(We  then  call  the  estimators  0^  “n*'^  consistent”).  In 
many  nonparametric  applications,  such  estimates  are 
known:  an  example  is  given  below.  Next,  let 
Xj  =  (Xj*,  .  .  .  ,  X;*),  1  s  i  <  be  j„  independent 

bootstrap  samples  of  size  n  drawn  from  Pg ;  cf.  Efron, 
1979.  Define  a  random  search  set  ©„  by 

(2.2)  0n  =  (0„(x„),0„(x,*) . 0„(X*)1 

The  search  set  (2.2)  may  be  rewritten  as  0„  = 
{0a(x„).  en(Xn)  +  Y,*  n-''^,  .  .  .  ,  0„(x„)  -t 
where  the  ,  1  <  i  s  j„,  are  conditionally  iid,  given  x„; 
here  Y;  =  [  0n(Xi*)  -  6n(Xn)  1 ouf  applications,  0 
is  typically  not  open.  Therefore,  the  structurally  simpler 
loci  search  set  (0„(Xn)’  (lin)  +  Yj  n“*  ,.., 
0n(Xn)  +  Yj_^n"’'^),  where  Y,,  Y^,  .  .  .  ,  Yj_^  are  iid,  cannot 
\x  used  here  since  there  is  no  guarantee  that 
0n  +  Yjn'*^  e  0.  Such  difficulties  arise  in  both  the  loca¬ 
tion  and  logistic  models.  On  the  other  hand,  this 
approach  can  be  made  to  work  in  classical  parametric 
problems,  provided  0  is  open;  see  also  J.-P.  Kreis,  1987, 
for  another  interesting  case  (involving  time  series)  where 
such  an  iid  local  search  is  quite  effective.  Obviously, 
when  the  structurally  simpler  search  just  mentioned  can 
be  justified,  it  might  be  prefened  because  of  the  greater 
freedom  in  the  method  for  simulating  Y| . Yj^. 

Example  2.1:  Location  model.  What  does  the  local  sto¬ 
chastic  search  amount  to  in  this  case?  Here  is  one  possi¬ 
bility.  Recall  that  here  the  parameter  0  is  0  s  (q,  F), 
where  q  is  the  center  of  symmetry  and  F  is  the  ‘‘sym¬ 
metric”  underlj/ing  distribution.  An  easy  choice  of  esti¬ 
mate  0„s(f|,F„),  which  satisfies  the  condition  (2.1) 
begins  by  taking  to  be  an  a-trimmed  mean  (co- 
ordinatewise  trimming  will  do).  Next,  let  be  the 
empirical  measure  of  the  centered  data 

X]  -  . x„  - 1|^,  and  let  F^  be  the  “symmetriza- 

tion”  of  Hn.  What  F^  actually  is  depends  on  the  exact 
definition  of  symmetry,  but  in  the  two  illustrations  given 
in  Example  1.3,  the  definition  of  F^  is  intuitively  clear. 
For  example,  in  the  isotropic  case,  F„  puts  the  uniform 
distribution  of  weight  n~'  on  each  of  the  spherical  shells 
{xeR**:  |x|  =  |Xj  -  t)„|,  1  ^  i  ^  n).  TTie  stochastic 

search  set  ©„  then  consists  of  bootstrap  replicas  of 
0„  =  (fl,,.^,,)  This  choice  of  0,j  v/iii  be  a  n"'^  consistent 
estimator  of  0  =  (11,  F)  with  values  in  0. 

(2.2) .  Stochastic  norms.  In  this  subsection  we  explain 
some  methods  for  replacing  I-)  in  (1.1)  by  a  computable 
approximation  |  •  |„,  as  in  (1 .6).  Our  procedure  involves 
the  notion  of  a  random  norm,  a  concept  that,  it  turns  out. 


has  had  a  long  history  in  statistical  methods.  For  a  pro¬ 
bability  space  X  and  a  linear  space  B,  a  stochastic  norm 
I  In  is  a  map  from  X  x  B  to  R'  such  that  for  each 
X  e  X,  b  ^  lbl„(x)  is  a  pseudo-norm  on  B.  Here  are 
several  examples  of  stochastic  norms. 

(2.2a).  Kolmogorov  distance. 

Let  x„  =  (xj . Xn)  be  iid  real  random  variables 

with  continuous  c.d.f.  F,  and  empirical  cdf  F„.  Let  B 
denote  the  space  of  bounded,  real,  right  continuous  func¬ 
tion  on  R'  with  left  limits.  A  stochastic  norm  |  on  B 
is  then  given  by  |b|„  =  max(max|b(xj)|,  max|b(Xi-)|l, 

isn  isn  ^ 

b  e  B.  The  Kolmogorov  distance  between  F,  F„  is  then 
given  by  the  stochastic  norm  I F  -  F^  |„. 

(2.2b).  Cramer  von  Mises  discrepancy.  Let 

X  =  (x, . x„),  F,  F„  be  as  in  subsection  (2.2a).  The 

Cramer-von  Mises  discrepancy  is 

( Jl  Fn(t)  -  F(t)  ]^dF(t)  ]'^.  For  fixed  F  this  defines  the 
L^(F)  norm  on  (F„  -  F).  It  is  asymptotically  equivalent 
to  [|[F„(t)  -  F(t)]^dF„(t)]''^,  a  stochastic  norm  |  I,, 

on  the  difference  Fn  -  F.  The  stochastic  norm  |  |„  is 

given  by  |f|^  s  jf(t)^dF„(t);  that  is  L^(Fn)  replaces  the 
norm  of  L^  (F). 

(2.2c).  Stochastic  norms  based  on  quadrants.  For 
te  R**,  let  K(t)  denote  the  lower  left  “quadrant”  with 
come’-  at  K(>)  =  (u  e  Uj  5  tj,  1  s  i  s  d),  where 

u  =  (u,....Ud),  t  =  (ti . tj).  Write  P(t)  for  P(K(t)), 

thus  identifying  P  with  an  element  of  B  h  L„  (R'^).  The 
quadrant  metric  between  two  probabilities  P,  Q  is  then 

sup|P(t)  -  Q(t)| 

( 

There  are  several  ways  to  replace  the  quadrant  metric 
with  a  more  computable  approximation. 

(i)  Let  t=  (tj,  .  .  .  ,  t|;__)  be  a  vector  iid  N(0, 1)  random 
variables  on  R"*.  A  simple  stochastic  norm  |  on 

L„  (R*^)  is  then  1  b  !„  =  max  |  b  (q)  1,  b  e  L... 
isk, 

(i-'.  Let  X] . x^  be  iid,  R'^-valued,  with  empirical 

measure  Let  C  be  the  smallest  cube  in  R**  con¬ 
taining  the  data  points  X;,  .  .  .  ,  x„.  Pave  C  with 
(p„)'*  cubes  of  equal  size;  let  Uj  be  the  number  of 
data  points  in  cube  i;  draw  (nj/nllq,  points  uni¬ 
formly  from  cube  i.  Then  a  stochastic  norm  on 
L^IR"^)  is  given  by  the  maximum  of  !b(t)|,  be  L„, 
over  the  points  t  just  drawn.  This  data  based  sto¬ 
chastic  norm  will  more  nearly  approximate  the  L„ 
norm  of  -  P  than  will  the  stochastic  norm  of  (i) 
above.  On  the  other  hand,  it  involves  more  compu¬ 
tation. 

(2.2d).  Stochastic  norms  on  half-spaces.  Let  P  be  a 
probability  on  R"*,  and  write  P(s,t)  =  P(A(s,t)),  for  the 
half-space  A(s,t),  as  explained  in  (1.3),  so  that  P  is  an 
element  of  L^fS**  x  R').  Let  S|,  .  .  .  ,  s^^  be  iid,  uni¬ 
formly  distriouted  on  .S'*,  and  t, . *k,  '•‘I  N(0, 1). 

Then  two  possible  stochastic  norms  on  B  =  L^iS"*  x  r') 

are:  max  |  b  (Sj,  t^) ),  and  max  sup  1  b  (Sp  t)  |,  b  £  B.  If  b  is  of 
isk,  _  isk,  I, 

the  form  b  =  P^  -  P  where  Pp  is  the  empirical  of  n  iid 
observations  from  P,  then  the  second  of  these  two  sto- 
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chastic  norms  is  more  data  dependent  and  thus  appears  to 
give  a  better  approximation  to  the  true  norm.  Data 
dependent  stochastic  norms  along  the  lines  of  the  second 
example  under  (2.2c)  are  also  possible.  Finding  the 
•‘best”  data  dependent  stochastic  norms  for  estimating 
I  Pn  -  P I  for  the  half-space  metric  is  currently  an  interest¬ 
ing  open  problem.  As  regards  simulation  of  Sj, .  .  .  ,  Sj^^ 
here,  note  that  the  following  simple  method  works;  simu¬ 
late  g],  •  •  •  ,  iid  standard  Gaussian  rv’s  on  R**,  and  set 
Si  =  gj  / 1  gj  I  where  |  gi  |  is  the  Euclidean  norm  of  gi. 

2.3.  Stochastic  functionals  and  a  generic  form  of  the 
test  statistics.  Evaluation  of  the  test  statistic  (1.1)  often 
entails,  as  the  examples  of  section  1  make  clear,  evalua¬ 
tion  of  quantities  such  as  Pg  (A)  for  various  parameters  6 
and  certain  sets  A,  such  as  half-spaces  and  quadrants.  As 
in  the  Fisher  example  (cf.  sec  1.)  such  an  evaluation  may 
not  be  easy.  Thus,  one  wishes  to  replace  Pe(A)  by  an 
approximation  P^^^IA),  or  more  generally,  the  Te  in  (1.1) 
by  an  approximation  Tg.  While  the  main  result  of  sec¬ 
tion  3  gives  a  very  general  approach  to  this  approxima¬ 
tion  problem,  the  special  examples  listed  in  section  1 
have,  in  fact,  involved  only  two  kinds  of  approximation. 
First,  for  given  0,  one  may  estimate  Pq  (A)  by  drawing  h^ 
iid.  variables  from  Pg,  obtaining  an  empirical  measure 
Pg"^  which  provides  the  estimate  Pg"'(A).  Such  a  method 
has  been  used  in  the  Fisher  distribution  problem  described 
in  section  1  (cf.  Beran  and  Millar,  1986).  Of  course,  in 
some  cases  it  may  be  extremely  difficult  to  carry  out  such 
a  simulation.  In  the  location  problem  on  R'^,  the  relevant 
probabilities  Pe(A)  can  be  represented  as  averages  of 
simple  measures,  over  a  certain  Haar  measure  (which 
depends  on  the  notion  of  symmetry  adopted).  While  such 
an  averaging  is  uncomputable,  in  general,  replacing  this 
Haar  measure  by  an  appropriate  empirical  probability 
yields  an  effective  approximation.  No  doubt,  in  some 
situations  it  may  be  also  possible  to  replace  Pg(A)  by  an 
analytic  approximation.  While  there  are  a  great  many 
statistics  of  the  form  (1.1),  (1.6)  the  paradigm  case 
underlying  the  specific  examples  of  section  1  is 

(2.3)  min  max  supn''^  |Pn(A(Sj,t))  -  Pg'l’ (A  (S;,  t))  1 

where  (Sj),  A(s,t)  are  given  in  (2.2d),  {0*1  is  a  collec¬ 
tion  of  jj,  bootstrap  replicas  of  a  preliminary  n’^^- 
consistent  estimator  of  0  (cf  search  method  (c))  and  P|"^ 
is,  for  each  0,  the  empirical  of  iid.  random  variables 
drawn  from  Pg.  Notice  that  this  latter  approximation 
need  be  performed  only  for  0  =  0  *,  1  s  i  ^  thus  the 
Monte  Carlo  for  the  approximating  functionals  will 
depend  upon  the  outcome  of  the  bootstrap  samples  which 
determine  the  local  search  set  for  0.  As  usual  P„  in  (2.3) 
is  the  empirical  measure  of  the  data  and  (Pg)  is  a  possi¬ 
bly  nonparametric  statistical  model. 

(2.4) .  Variable  ©„  vis  a  vis  variable  |  The  test 

statistics  suggested  by  (1.2)  have  the  form  min  IAqL, 

e.e„ 

where  ©„  is  a  variable  subset  of  ©  and  |  is  a  pseudo 
norm  depending  also  on  n.  Even  if  ©„  T  0  and 
I  In  I  I  (‘his  last  denoting  a  norm),  one  cannot  in 
general  hope  for  convergence  of  minlAgln,  rnatter 


how  smooth  Ag  may  be.  The  difficulty  can  often  be 
traced  to  the  infinite  dimensionality  of  ©:  analogous 
objects  in  the  finite  dimensional  case  typically  converge 
(Beran  and  Millar,  1988a). 

Here  is  a  simple  illustration  of  the  difficulty. 

Example.  Let  B  be  the  linear  space  of  all  real  sequences 
b  =  (b,,b2,...)  such  that  =  0  for  all  sufficiently  large  k. 
For  a, be  B  define  <a, b>  =  £a;bi.  Let  Cj  denote  the 
members  of  B  with  1  on  the  i*  co-ordinate  and  0  else¬ 
where.  Define  |  b  L  =  max  I  <  b,  C;  >  |  and  0,  = 
^  isk,  ^ 

(ei . ej__l.  Then  min  101^  -  0  j„  >  k„,  =  1,  k„  >  j„; 

0  e 

thus,  without  conditions  on  kp  there  can  be  no  conver¬ 
gence  of  min  1 9  L  as  n  - ' 

Despite  the  simplicity  of  this  example,  its  basic  moral 
carries  over:  if  convergence  of  min  |  Ag  |„  is  desired,  then 

the  “size”  of  the  search  set  ©„  should  not  be  “too 
sophisticated”  for  the  norm  j  |„.  Intuitively,  one  should 
not  “over  fit”  the  model  relative  to  the  measure  of 
discrepancy  |  In  particular,  applications  will  often 
therefore  require  conditions  on  the  relative  size  of  the 
search  set  ©„  and  the  sample  size  determining  the  sto¬ 
chastic  norm  |  |„. 

2.S.  Critical  values.  Critical  values  for  test  statistics 
(1.6)  can  often  be  obtained  by  bootstrap  method.  There 
are  several  valid  ways  to  do  this.  A  technique  which 
entails  significant  computational  savings  is  a  “conditional 
bootstrap  method”,  which  for  convenience  we  describe 
only  for  the  paradigm  statistic  (2.3). 

First,  fix  the  random  variables  (Sj,  1  <  j  s  k„)  which 
determine  the  stochastic  norm;  fix  the  random  variables 
(0j  ,  1  s  i  <  j„)  which  determine  the  local  stochastic 
search  of  0,  and  fix  the  simulated  estimates  P^'^.  Next, 
draw  rrin  bootstrap  samples  u  *,  .  .  .  ,  u^,  each  of  size  n, 
from  the  fitted  model  Pg  here  §„  is  the  same  n''^- 
consistent  estimator  used  to  generate  the  search  set  for  0. 
It  is  assumed  that  the  u  *  are  conditionalJy  independent 
(given  Xn)  of  the  bootstrap  samples  used  to  construct  0;  , 
and  of  the  random  variables  used  to  construct  P^"\  and 
independent  of  (Sj).  Let  Pjj(u*;-)  denote  the  empirical 
measure  of  Uj  ,  and  let  G_  denote  the  empirical  cdf  of 
min  max  sup n’'^  IPnC^i  ;A(Si,t))  -  P^’I^(A(Si,t))l, 

1  <  1  s  m„.  Then  under  suitable  conditions  the  quantiles 
of  G*  give  asymptotically  valid  critical  values  for  the 
stochastic  test  statistic  (2.3).  A  proof  can  be  based  on  the 
asymptotic  representation  theorem  in  section  3,  together 
with  techniques  developed  in  Beran  and  Millar,  1987. 
Other  valid  bootstrap  methods  could  involve  recalculating 
cither  (or  both)  the  search  set  (0*1  or  the  (Sj)  for  each 
bootstrap  sample  u  *,  1  <  1  <  rrin.  The  added  computa¬ 
tional  burden  is  enormous  and  up  to  first  order  asymptot¬ 
ics,  there  is  no  gain  over  the  “conditional”  method. 
Whether  or  not  there  is  any  compensating  extra  “stabil¬ 
ity”  in  such  methods  is  an  interesting  open  question  that 
requires  second  order  asymptotic  analysis. 
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3.  Asymptotic  representation  theorem. 

This  section  establishes,  under  suitable  hypotheses,  the 
asymptotic  form  of  test  statistics  of  the  form  (1.6).  The 
formulation  is  sufficiently  general  so  that  the  result 
applies  to  statistics  based  on  sieves  (cf.  section  2),  to  the 
stochastic  statistics  illustrated  by  (2.3),  as  well  as  to  a 
number  of  other  possibilities.  It  can  be  used  to  show 
under  supplementary  hypotheses  that  approximations  like 

(1.6)  have  an  asymptotic  form  similar  to  that  of  (1.1). 
The  triangular  array  formulation  makes  the  result  a  con¬ 
venient  tool  in  establishing  the  asymptotic  validity  of 
bootstrap  methods  in  the  calculation  of  critical  values  (cf 
subsection  (2.5)).  Motivation  for  the  particular  formula¬ 
tion  adopted  here  comes  from  the  “paradigm”  statistic 

(2.3),  the  logistic  testing  problem  (cf.  section  1)  which  is 
not  of  the  form  given  in  the  paradigm,  and  the  possibility 
of  extending  sieve  methods  from  an  MLE  framework  to  a 
“minimum  distance”  framework. 

Let  X  be  a  measure  space,  and  Xn  =  (xj,  .  .  .  ,  x„)  a 
vector  of  X  valued  random  variables  having  joint  distribu¬ 
tion  P„.  Note  that  the  general  formulation  does  not 
require  that  the  x;  be  independent.  Let  0  be  a  subset  of  a 
normed  linear  space  Bj,  and  let  ©„,  n  =  1,2,...  be  subsets 
of  9.  The  subsets  0„  are  allowed  to  be  random.  For 
each  0  e  0,  let  T0  be  a  functional  defined  on  X  (or  on 
the  n-fold  product  of  X)  with  values  in  a  possibly 
different  normed  space  Bj.  Let  f „  =  T„  (x^)  be  a  Bj- 
valued  statistic  on  X". 

Let  I  In  be  a  (possibly  random)  pseudonorm  on  B2. 
For  each  0,  n,  let  Te  be  a  B2-valued  functional  on  0  also 
possibly  random.  In  many  applications  T§  is  an  easily 
computable  approximation  to  Tq,  based  on  Monte  Carlo 
simulatiotts.  The  construction  of  0„,  |  Tg  may 
involve  certain  auxiliary  randomization.  Let  be  the 
probability  governing  the  distribution  of  Xn  as  well  as 
these  constructions. 

Fix  00  e  0.  The  hypotheses  are  as  follows. 

(3.1) .  Identifiability.  For  each  e  >  0,  c  >  0  there  exists 
6  >  0  such  that 

limQ„{  inf  lTe"-To"J„  s  5)  >  1-e 

n— W  ©n 

ie-0oi>c 

(3.2) .  Differentiability.  There  is  a  continuous  linear 
map  I:  span  0  — »  B2  such  that,  for  every  e  >  0,  there 
exists  5  such  that 

limQ„l  sup  (|T0-Te„-l(0-0o)l„/|0-0oll 

n-^  |e-0o|so 
0*e. 

<  e)  =  1. 

(3.3) .  Non-singularity. 

|l(0-0o)|„  s  CJ0-0ol  V0e  0„ 

where  Qj'’  is  a  light  sequence  under  Q„. 

(3.4) .  Consistency  of  fn:  n‘'^|t„  -  Tq^I^  is  tight  under 
Qn 


(3.5) .  Approximation  property: 

sun  ITJ  -  T0l„n''^  0 

Qo 

in  Q„  probability. 

(3.6) .  Proximity  of  0o: 

inf  n''2|T0-TeJ„  s  A„  is  Q„  tight. 

Condition  (3.5)  says  that  T0  approximates  To  in  a 
suitable  fashion.  In  the  case  where  To  is  an  empirical 
measure  obtained  by  simulating  iid  observations  from  a 
probability  Po  s  To,  familiar  exponential  bounds  on  the 
empirical  process  (cf,  e.g.,  Alexander,  1984)  quickly  yield 
simple  conditions  on  the  size  of  ©„  vis  a  vis  the  number 
of  simulations  used  to  construct  Tq.  See  Beran  and  Mil¬ 
lar,  1988a  for  a  simple  illustration.  If  no  approximation 
Tq  to  To  is  needed,  as  in  the  case  of  certain  problems 
involving  multivariate  normal  distributions,  then  of  course 

(3.5)  is  automatically  satisfied.  The  proximity  condition 

(3.6)  ensures  that  the  point  %  is  not  far  from  0„.  If  ©„ 
consists  of  bootstrap  replicas  of  a  preliminary  estimate, 
and  if  0o  is  the  ‘true’  parameter,  then  (3.6)  is  automatic. 
In  case  ©„  is  a  sieve,  (3.6)  imptoses  conditions  on  the 
speed  with  which  0„  exhausts  0.  The  other  conditions 
are  n-dependent  variants  of  familiar  conditions  from  the 
theory  of  minimum  distance  estimators.  Roughly  speak¬ 
ing,  the  effect  of  (3.1)  is  to  ensure  that  the  minimizing 
0-points  for  (1.6)  can  be  found  eventually  (as  n  — >  =>0)  in 
any  given  “ball”  about  the  “true”  parameter,  (3.2),  (3.3) 
ensure  that  said  “ball”  has  a  diameter  of  order  n*'^.  It  is 
considerations  such  as  these  that  suggest  the  efficiency  of 
the  local  asymptotic  search  of  0  described  in  section  2. 
Theorem  1.  Assume  (3.1)  -  (3.6).  Let  0n  be  any 
sequence  such  that  n’'^(0„  -  0o)  is  norm  bounded  in  B,. 
Let  Wn  =  n*'^  1 1„  -  Te_^  ].  Then  under  (J„ 

infn'^2|t  -Te"!  -  inf  |W„  -  I  (n'^^o  _  0  ))i^ 

0>  e.  n  <7  0,  " 

+  Oq/1). 

A  novelty  of  this  formulation  is  that  T§,  ©„,  |  •  )„  all 
depend  on  n  and  can  be  random.  Moreover,  the  parame¬ 
ter  set,  ©n  is  not  assumed  open,  unlike  classical  develop¬ 
ments.  Nevertheless,  despite  these  novelties,  the  proof 
can  be  accomplished  by  a  somewhat  complicated  exten¬ 
sion  of  the  methods  of  Wolfowitz  (1953),  Pollard  (1980), 
Bolthausen  (1977),  Millar  (1985)  and  others. 

Remark,  (a)  The  identifiability  condition  can  be 
replaced  by; 

Qn {  3  at  least  one  0  ^  ©„  such  that  (0  -  GqI  >  cl  — »  0 

as  n  — »  00.  When  ©„  is  a  local  stochastic  search  as  in 
Section  2,  it  is  often  quite  easy  to  write  down  an  analytic 
condition  for  the  above  convergence. 

(b)  Differentiability  may  be  replaced  by  the  following 
“asymptotic"  differentiability  condition,  which  will  be 
employed  elsewhere.  There  exist  constants  Kj,  K2  such 
that 

|T0-Te„-l(0-0o)ln  s  K,n-^|0-0ol 

-t-  K;n“10-0ol^  +  Op(|0-0o)|) 


66 


where  8  >  0  aiK^  ct  <  1/2. 

(c)  The  derivative  I  can  depend  on  n,  provided 
replaces  1  everywhere  in  the  complete  theorem  statement. 

(d)  In  many  applicarions  W„,  defined  in  the  theorem 
statement,  converges  to  W,  (n''^(0  -  6„):  0  €  0,,)  more 
or  less  approximates  span  0,  and  |  In  I  I,  as 
n  — >  oo.  Oue  therefore  expects  that  the  left  side  in 
theorem  1  converges  to  inf  ( W  -  I  (6)  j,  which  is  the 

Oespan  8 

classical  form  of  the  limit.  In  particular  applications  this 
convergence  can  indeed  be  established;  however,  as  the 
example  in  subsection  (2.4)  makes  clear,  one  cannot,  in 
the  generality  considered  here,  expect  such  convergence 
to  happen,  without  supplementary  conditions  which  may 
regulate  the  size  of  0n  and  the  strength  of  | 


Proof.  Because  of  the  approximation  property  (3.5),  it 
suffices  to  analyse  inf  n’'^  1T„  -  Tgl^.  Since 


e<d. 


n-i/2 


inf  |T„-Tel„>  inf  ITq  -  TeJ„  -  I  WJ„n-  . 
ie-eol>c  l6-6oi>c 

hypotheses  (4.1)  (4.4)  show  that,  for  any  c, 

inf  I  Tn  ~  'I'e  1  remains  bounded  away  from  0  in  pro- 
e«  e„ 

ie-0oi>c 

bability.  On  the  other  hand,  ITn  ~  Teoln  *his. 


‘eh 


0 


together  with  (3.6)  shows  that  inf  |T 

0r  e. 

l0-eoi^sc 

and  so  inf  |t„-T0ln  =  inf  (tp-Teln  for  any 

0  «  00  0<  0„ 

|0  -  0ol  SC 

preselected  c  >  0. 

Next  let  dn  =  (max {An, bl)[|W„|„ V  1  ]Cn"'  where 
An,  Cn  come  from  (3.6),  (3.3).  TTien  (dn)  is  tight. 

Without  loss  of  generality,  assume  Cn  s  -^(11111+  1)"'. 

Let  Nn  be  the  set  of  0  e  0^  such  that 

ITe  -  -  1(0  -  0o)|  <  I/2Cn  |0  -  0ol.  B)  (3.2),  (3.3) 

and  (3.6),  N„  is  nonempty,  and  by  (3.3),  the  preceding 

paragraph,  and  an  elementary  argument  inf  |  -  T0  L  = 

0<e„ 

^n^  1T„  -  T0ln.  On  the  other  hand,  if  0  «  Nn,  (Tn  -  T0ln 

>  |l(0-0o)l„-|T0-T0j„  -  l/2C„|0-0ol 

^  l/2C:n  10-001  -n'‘'^|Wnln.  Since  N„  ^  {0  6  0n; 
n’*^  10  -  0q|  <  dn),  which  is  nonempty  by  (3.6)  and  the 
definition  of  (!„,  the  calculation  in  the  preceding  sentence 
implies 


inf  n''2|Tn-T0|n  = 

0^  No 


inf 

OtOo 

n''‘|0-eo|sd. 


|Wn 


|tn-T0|n’^2 

(0-0n)n''")(„  +  oq/1). 


inf 
0<@. 
|0-0i>|sd, 
,l/2> 


Next,  note  that,  with  Qn  probability  approaching  1, 

m^f  |Wn-l(0-0„)n''2|„  <  (||l||+l)dn. 

|9-eo|s4,n 


On  the  other  hand,  by  definition  of  dn,  if  |0  -  0()|  >  d„, 
then,  since  C„  s  l/4(||l||  +  I)"’; 

|W„-l(n''2(0-eo))|n  >  |l(n''2(0-0„))l  -  IWnl 


^  Cn'd„-dn 

^  4(11111+  l)dn-dn  >  3(11111+  l)dn 

so  that  inf  1W„  -  l(n*'2(0  -  0o))ln 

9  f  0B 

,  |0-eol«d. 

=  inf  iw„-l(n''2(0-0o))l„,q.e.d. 

9*  ©n 
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BOOTSTRAP  INFERENCE  FOR  REPLICATED  EXPERIMENTS 


Walter  Liggett,  National  Bureau  of  Standards,  Gaithersburg,  MD  20899 


ABSTRACT 

Inference  methods  valid  for  nonnormal  error  are 
proposed  for  experiments  in  which  each  design 
point  is  replicated  three  or  more  times. 
Differences  between  the  replicates  provide  the 
data  needed  for  a  pooled  estimate  of  the  error 
density,  and  this  density  forms  the  basis  for  the 
bootstrap.  The  density  estimator  is  specified  for 
symmetric  error  in  Liggett  (Biometrika,  1988),  and 
this  symmetric  estimator  has  been  generalized  to 
asymmetric  error.  In  this  paper,  the  application 
of  this  density  estimator  to  designed  experiments 
is  considered.  The  lack-of-fit  test  is  of 
particular  interest.  The  extension  of  the  density 
estimator  to  data  requiring  a  blocking  variable 
and  to  data  with  dispersion  effects  is  discussed. 
The  bootstrap  based  on  this  density  estimator  is 
shown  to  be  valid  for  smaller  sample  sizes  when 
the  test  statistics  are  robust.  Estimation  of  the 
error  density  is  illustrated  with  measurements 
replicated  at  different  laboratories. 

1.  INTRODUCTION 

In  industrial  experimentation,  when  the  error 
properties  are  crucial  to  the  inferences  drawn, 
the  possibility  of  nonnormai  experimental  error 
must  be  considered.  One  source  of  variability 
that  is  potentially  nonnormal  is  the  inhomogeneity 
of  physical  samples  of  bulk  materials.  For 
example,  trace  concentrations  of  a  particulate 
substance  in  portions  of  a  bulk  material  are 
usually  nonnormal.  Another  potentially  nonnormal 
source  is  material  degradation  that  accelerates  as 
it  proceeds.  Many  corrosion  processes  have  this 
property  as  does  spoilage  due  to  bacterial  growth. 
Othei  potentially  nonnormal  sources  are  inherent 
in  measurement  procedures.  Examples  include  loss 
of  analyte  during  preparation  of  the  physical 
sample,  interfering  peaks  in  spectra  and 
chromatograms,  aberrant  results  from  the  software 
that  automatically  locates  peaks  and  measures 
their  height  or  area,  and  inconsistent  control  of 
variation  due  to  poor  understanding  of  the 
sensitivities  of  the  measurement  procedure.  This 
paper  discusses  a  bootstrap  method  for  obtaining 
valid  inferences  when  the  experimental  error  is 
nonnormal . 

An  approach  to  bootstrap  inferences  for 
replicated  experiments  is  provided  by  the  pooled 
error  density  estimator  given  by  Liggett  (1988, 
1989).  This  estimator  is  based  on  the  assumption 
that  replicate  measurements  involve  independent 
and  identically-distributed  realizations  of  the 
measurement  error,  the  usual  assumption  in 
designed  experiments.  This  assumption  leads  to  a 
relationship  between  the  error  density  and  the 
densities  of  the  first  and  second  differences 
between  error  realizations.  These  latter 
densities  can  be  estimated  from  differences 
between  replicates.  The  computation  of  the  error 
density  is  a  fitting  by  weighted,  nonlinear  least 
squares. 

Bootstrap  inferences  in  regression  can  be  based 
on  other  density  estimators  (Efron,  1982;  Efron 
and  Tibshirani,  1986).  When  each  design  point  is 
replicated  three  or  more  times,  a  separate  density 
estimate  for  each  design  point  is  an  alternative 


to  a  pooled  density  estimate.  In  this  case,  each 
bootstrap  repetition  is  obtained  by  separately 
sampling  with  replacement  the  measurements  at  each 
design  point  (Efron,  1982).  When  the  number  of 
replicates  at  each  design  point  is  small,  this 
approach  suffers  from  the  discrepancies  between 
the  small-sample  empirical  distributions  and  the 
true  error  distributions.  If  a  pooled  density 
estimator  can  be  justified,  other  pooled 
estimators  might  be  chosen  in  place  of  the 
replicate-differences  density  estimator  in  Liggett 
(1988,  1989).  A  model  of  the  true  values  at  the 
design  points  can  be  fit  to  the  data,  and  the 
residuals  can  be  computed  and  combined  to  form  a 
density  estimate.  These  residuals  might  be 
obtained  from  separate  location  estimates  for  each 
design  point  or  from  a  more  restrictive  model  of 
the  regression  function.  When  the  number  of 
replicates  is  small,  the  location  estimates  for 
the  design  points  are  unstable,  and  the  naive 
combination  of  residuals  does  not  provide  a 
completely  adequate  density  estimate.  The 
combination  of  residuals  from  a  more  restrictive 
regression  model  leads  to  an  error  density  that 
depends  on  the  design  matrix.  Accounting  for  this 
dependence  in  a  lack-of-fit  test  seems  difficult. 
Thu  replicate-differences  density  estimator  does 
not  suffer  from  these  problems  and  thus  seems 
attractive. 

Because  of  the  possibility  of  nonnormal  error, 
robust  statistics  are  the  proper  choice  for  the 
desired  inferences  (Hampel,  et  al.,  1986).  The 
use  of  the  bootstrap  to  find  the  distribution  of 
robust  statistics  is  the  focus  of  our  discussion. 
Thus,  our  interest  is  in  robustness  of  validity 
for  statistics  with  good  robustness  of  efficiency. 
Hampel,  et  al.  (1986)  offer  robust  tests  for 
linear  models  based  on  the  asymptotic  distribution 
of  the  test  statistics.  We  propose  to  use  the 
same  test  statistics  but  to  replace  the  asymptotic 
distribution  with  the  bootstrap. 

Designs  with  three  or  more  replicates  at  each 
design  point  have  recently  been  recommended  for 
applications  in  which  dispersion  effects  are 
potentially  important  (Box,  1988  and  the 
discussion).  These  designs  are  also  appropriate 
for  the  density  estimates  given  in  Liggett  (1988, 
1989).  The  recoiiimendation  of  designs  with  three 
or  more  replicates  at  each  design  point  is  a 
considerable  change  from  the  usual  recommendation. 
For  example,  the  number  of  replicates  recommended 
for  lack-of-fit  tests  may  be  as  small  as  5,  a 
number  too  small  for  investigation  of  either 
dispeision  effects  or  nonnormality.  When  the 
experimental  error  is  dominated  by  a  single  error 
source,  the  error  is  often  both  nonnormal  and  has 
a  variance  that  depends  on  the  controllable 
factors  in  the  experiment.  Thus,  dispersion 
effects  must  be  considered  when  nonnormality  is 
important,  and  conversely,  nonnormality  must  be 
considered  when  dispersion  effects  are  important. 
The  inclusion  of  both  nonnormality  and  dispersion 
effects  in  the  analysis  is  often  needed. 

In  this  paper,  three  aspects  of  the  application 
of  the  replicate-differences  density  estimator  to 
designed  experiments  are  considered.  The  first, 
which  is  discussed  in  Section  2,  is  the  extension 
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of  the  density  estimation  method  to  the  case  in 
which  the  replicates  of  the  experiment  require  a 
blocking  variable  and  to  the  case  in  which 
dispersion  effects  are  present.  The  second,  which 
we  discuss  in  section  3,  is  the  effect  of  the 
choice  of  test  statistic  and  experimental  design 
on  the  validity  of  the  bootstrap  when  the  sample 
sizes  are  moderate.  The  third,  which  we  discuss 
in  Section  4,  is  the  performance  of  the  density 
estimator  on  a  set  of  measurements  with  a  variety 
of  real-world  imperfections. 

2.  DENSITY  ESTIMATION 

The  model  on  which  this  paper  is  based  differs 
from  the  usual  model  for  designed  experiments  only 
in  the  omission  of  the  assumption  of  normality. 

The  jth  replicate  measurement  at  the  kth  design 
point  is  given  by 


Pd(x) 
^2q  = 

Pc(x) 


=  (V2Sj.)'^  E  32n  (P2n{x/(72s^)}  , 
q=0  ^  ’ 

K  K  2 

{  E  (rk'l)}'^  D  — 
k=l  k=l  "^k 

«  E  E  q>2q{(yjk  -  yj.k)/(V2Sr)}, 
=  ijbs^)  ^  E  Cq  iPq{x/(V6s^)} , 


k=l  k=l  '^k('^k'2) 


(5) 


(6) 


(7) 


yjk  ^  ’'k^®  '^jk  ^  l,...,rij;  k  =  1,...,K)  (1) 

where  Xj^^  is  the  row  of  the  design  matrix 
corresponding  to  the  kth  design  point,  6  is  the 
vector  of  unknown  parameters,  E,j^  is  the  zero 
mean,  independent  and  identically-distributed 
error,  r^  is  the  number  of  replicates  at  the  kth 
design  point,  and  K  is  the  number  of  design 
points . 

As  specified  in  Liggett  (1988,  1989),  the 
replicate-differences  estimator  of  the  density  of 
Ejjj  is  based  on  the  Hermite  function  expansion 
(Schwartz,  1967).  Note  that  this  orthogonal 
function  expansion  is  different  from  the  Edgeworth 
expansion.  The  Hermite  functions  can  be  defined 
by  the  recursion 

<Po(x)  =  ii‘‘  exp(-x2/2) 

<Pl(x)  =  2-'  tt'J  x  exp(-x2/2)  (2) 

(flq(x)  =  (2/q)2  X  <(iq.j(x)  -  {(q  -  l)/q}2  <(iq.2(x). 

To  apply  the  Hermite  function  expansion  to  the 
measurements,  we  use  a  scale  factor  computed  from 
the  median  of  the  absolute  differences  between 
replicate  measurements 

Sj.  =  (0.6745)'^  (2)“2  median(yi|^  -  y;'kl 

(l<J<j'<ri;,  k  =  1,...,K).  (3) 

Division  by  0.6745  and  ^2  makes  Sj.  an  unbiased 
estimate  of  the  error  standard  deviation  in  the 
normal  case.  We  estimate  the  error  density  by 
fitting  the  functional  form  for  the  density  given 
by 

Q 

p(x)  =  (1/s^)  E  aq  <<iq{(x  +  a)/s^),  (4) 

q=0 

where  the  parameter  a  is  chosen  so  tnat  the  mean 
of  p  is  zero.  If  the  error  density  is  assumed  to 
be  symmetric  as  in  Liggett  (1988),  then  a  =  0  and 
only  Hermite  functions  of  even  order  are  needed  in 
the  expansion. 

The  error  density  is  estimated  through  its 
relation  to  the  densities  of  the  first  and  second 
differences  between  replicate  measurements  at  the 
same  design  point.  The  estimates  of  these 
densities  on  which  the  fitting  is  based  are  given 
by 


X  E  S  E  <Pq{(2yhk  -  Yjk  -  y  y  y.)  /  (Jbs^))  .  (8) 
h  j>j ' 

Equations  (6)  and  (8)  show  explicitly  how  the 
replicate  differences  enter  the  error  density 
estimation.  As  specified  in  Liggett  (1988,  1989), 
the  fitting  is  accomplished  by  a  weighted 
nonlinear  least  squares  algorithm.  Approaches  to 
avoiding  negative  values  of  the  density  estimate 
are  presented  in  Liggett  (1989). 


Estinated  Quantiles 


Figure  1.  Error  density  estimated  from  Quinlan's 
cable-shrinkage  experiment  (Box,  1988). 

An  experiment  with  4  replicates  at  each  of  16 
design  points  was  performed  by  Quinlan  (Box, 
1988).  Quinlan  included  many  replicates  to 
facilitate  the  analysis  of  dispersion  effects. 

The  reanalysis  of  Quinlan's  data  by  Box  and  the 
discussants  of  Box's  paper  (Box,  1988)  show  that 
the  dispersion  effects  are  not  particularly 
strong.  Ignoring  the  dispersion  effects,  we  can 
obtain  an  error  density  estimate  from  Quinlan's 
data.  This  estimate  is  interesting  despite  the 
violation  of  the  assumption  of  identically- 
distributed  error  that  underlies  the  replicate- 
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differences  density  estimate.  Figure  1  shows  the 
error  density  estimate  based  on  the  assumption  of 
symmetry.  Figure  1  is  a  quantile-quantile  plot  of 
the  estimated  error  density  versus  the  normal 
density.  We  see  that  the  center  of  the  estimated 
density  looks  very  much  like  the  normal,  but  that 
the  tails  of  the  estimated  density  are  somewhat 
thicker  than  the  normal. 

The  tails  of  the  error  in  Quinlan's  data  can  be 
investigated  further  by  means  of  a  half-normal 
probability  plot  of  the  absolute  differences 
between  replicate  measurements.  This  probability 
plot,  which  is  shown  in  Figure  2,  does  not  appear 
to  be  perfectly  straight.  Rather,  this  figure 
suggests,  as  does  Figure  1,  that  the  density  of 
the  absolute  differences  has  a  tail  somewhat 
thicker  than  the  normal. 


Absolute  Diffennees 


Figure  2.  Replicate  differences  from  Quinlan’s 
cable-shrinkage  experiment  (Box,  1988). 

The  existence  of  dispersion  effects  is  a 
reasonable  explanation  for  the  thick  tails 
apparent  in  Figures  1  and  2  since  dispersion 
effects  give  a  set  of  error  realizations  that 
appear  to  arise  from  normal  scale  mixtuic  if 
dependence  on  the  experimental  factors  is  ignored. 
Thus,  Figures  1  and  2  do  not  provide  any  important 
clarif ication  of  Quinlan’s  data.  Consideration  of 
Ouinlan's  data  in  this  paper  is  intended  to  link 
replicate-differences  density  estimation  with  the 
designs  adopted  for  the  analysis  of  dispersion. 
This  link  raises  the  question  of  how  both  analyses 
can  be  combined. 

Box  (1988)  mentions  that  Quinlan’s  experiment 
was  run  in  the  "split-plot"  mode  and  that  the 
experiment  may  involve  two  error  components  only 
one  of  which  is  reflected  in  the  replicates.  One 
way  to  mitigate  this  problem  is  to  measure  all  the 
design  points  once,  then  measure  all  the  design 
points  a  second  time,  and  continue  to  repeat  this 
as  many  times  as  necessary.  If  this  procedure 
were  to  be  followed,  then  we  would  likely  have  to 
include  a  blocking  variable,  that  is,  to  replace 
r.jk  in  (I)  with  6j  +  to  obtain  an  adequate 

model  of  the  measurements.  To  estimate  the  error 


density,  we  would  first  have  to  estimate  the 
values  of  6 j .  Various  ways  to  estimate  the  6j 
suggest  themselves.  In  the  present  context, 
estimation  of  the  6:  by  maximizing  seem<5 
interesting  since  this  method  can  be  thought  of  as 
choosing  the  to  make  the  error  look  as  normal 
as  possible.  Differentiation  of  shows  that  the 
resulting  estimate  of  6j  is  an  M-estimate  with 
redescending  v  function. 

The  most  general  way  to  combine  nonnormality 
and  dispersion  effects  in  an  error  model  is  to 
allow  the  error  density  to  depend  in  some  unknown 
way  on  the  controllable  factors.  Such  a  model 
would  limit  the  pooling  that  could  be  done  in  the 
estimation  of  the  error  densities  and  would  thus 
require  a  very  large  number  of  replicate 
measurements.  One  way  to  limit  the  number  of 
measurements  required  is  to  assume  that  the 
dispersion  effects  only  involve  the  scale  of  the 
error  so  that  after  the  replicate-differences  have 
been  corrected  for  the  scale  effects,  the  error 
density  can  be  estimated  by  pooling  all  the 
corrected  differences.  Let  the  error  term  in  (1) 
be  given  by  where  the  dependence  of  on 

k,  the  design  point,  can  be  modeled  by  a  function 
with  fewer  unknown  parameters  than  K,  the  number 
of  design  points.  We  propose  to  estimate  and 
then  correct  the  replicate-differences  using  this 
estimate.  The  estimators  of  dispersion  effects 
suggested  by  Box  and  the  discussants  (Box,  1988) 
may  be  appropriate.  A  robust  estimator  for  the 
dispersion  efiectb  might  be  better. 

3.  BOOTSTRAP  INFERENCE 

Bootstrap  inference  consists  of  finding  the 
distribution  of  a  statistic  by  computing 
realizations  of  the  statistic  from  independent 
samples  drawn  from  an  estimated  density.  In  this 
paper,  we  focus  on  statistics  for  testing  lack  of 
fit,  which  is  an  important  inference  in  designed 
experiments.  Other  important  inferences  in 
designed  experiments  involve  confidence  intervals 
for  the  differences  between  points  on  the  response 
surtace  ana  confidence  intervals  for  the  values  of 
The  validity  of  bootstrap  inferenre  depend*?  on 
both  the  accuracy  of  the  density  estimate  and  on 
the  characteristics  of  the  statistic,  which  in 
turn,  depend  on  the  experimental  design.  In  this 
section,  we  consider  how  the  choice  of  design  and 
statistic  affect  the  validity  of  bootstrap 
inference  based  on  the  replicate-differences 
density  estimate. 

The  percentiles  of  the  replicate-differences 
density  estimate  have  larger  bias  and  larger 
standard  deviation  in  the  tails  than  at  the  center 
of  the  distribution.  This  suggests  that  this 
density  estimate  will  provide  accurate  percentiles 
for  a  location  estimate  robust  against  stretched- 
tail  error  even  when  the  percentiles  of  the 
density  estimate  itself  seem  inaccurate.  This 
principle  can  be  illustrated  with  the  median  of 
three.  If  the  error  density  and  error 
distribution  are  given  by  p(x)  and  F(x), 
respectively,  then  the  density  of  the  median  of 
three  independent  error  realizations  is  given  by 
6F(I-F)p.  The  factor  6F(l-F)  downweights  the 
tails  of  p. 

To  provide  some  specific  insight  into  the 
validity  of  the  bootstrap  based  on  the  replicate- 
differences  density  estimate,  we  consider  an 
example  in  which  the  sample  size  is  small  for  the 
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purpose  of  density  estimation  and  the  error 
distribution  is  quite  asymmetric.  We  consider  3 
replicates  at  each  of  20  design  points  and  error 
distributed  as  with  3  degrees  of  freedom.  In 
the  Hermite  function  expansion  of  the  error 
density  (i),  we  let  Q  =  10.  The  mean  and  standard 
deviation  of  the  percentiles  obtained  in  100 
trials  are 

Error  Density  Median  of  Three 
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Dividing  the  standard  deviations  by  10  to  obtain 
standard  errors  of  the  means,  we  see  that  the 
density  estimates  are  biased  near  the  center  and 
in  the  tails.  In  the  tails,  both  the  bias  and  the 
standard  deviation  are  smaller  for  the  median  of 
three  than  for  a  single  error  realization.  Thus, 
the  replicate-differences  density  estimate  clearly 
provides  more  accurate  results  for  the  median  of 
three.  These  trials  suggest  that  even  for  the 
median  of  3,  the  design,  3  replicates  at  20  design 
points,  and  the  error  density,  x^  with  3  degrees 
of  freedom,  may  not  lead  to  an  adequately  stable 
density  estimate.  In  an  application  of  the 
replicate-differences  density  estimate,  the  effect 
of  the  stability  of  the  density  estimate  on  the 
desired  inferences  should  be  investigated  by  Monte 
Carlo  experiment. 

To  test  lack  of  fit,  we  specialize  the  x-tests 
discussed  by  Hampel,  et  al.  (1986).  A  lack-of-fit 
test  is  a  comparison  of  the  fit  of  the  model  of 
interest  with  the  fit  of  the  most  general  model 
that  can  be  estimated,  namely,  a  location  estimate 
for  each  design  point  based  on  just  the 


measurements  ac  that  design  point.  Let  uj,  U2, 

...  be  variables  that  specify  the  factor  levels  in 
the  experimental  design,  and  let  x^0  be  a  K-term 
polynomial  in  these  factors,  x^  =  (1,  uj,  U2, 

U]^U2,  ...).  We  can  choose  such  a  polynomial  with 
the  property  that  nil  the  elements  of  0  can  he 


estimated  and  the  property  that  the  model  we  wish 


to  test  is  given  by  setting  9^  =  0  for  elements  in 
X  not  in  the  model.  Let  Xj^  be  the  value  of  x  at 
the  design  point  k.  Hampel,  et  al.  (1986,  p.  3A6) 
offer  a  test  based  on 


K  ri^ 

r(e)  =8  Z'  x(x)5, 

k=l  j=l 

where  the  function  x  is  chosen  based  on  the 
desired  robustness  properties  and  o  is  a  scale 
parameter  that  must  be  estimated.  Our  notation 
differs  from  that  in  Hampel,  et  al.  (1986)  in 
obvious  ways.  Hampel,  et  al.  (1986)  propose  the 
statistic 


(yjk  ■  X|^^e)/o),  (9) 


T  z  1 

j-  -  (ming  {r(  6  )  1 6jjj=0  terms  not  in  model) 

-  ming { r(6 ) ) ] ,  ( 10 ) 

where  n  =  8  rj^,  the  total  number  of  raeasui ements , 
and  h  is  the  number  of  terms  in  the  model  to  be 
tested.  Hampel,  et  al.  (  1986)  give  the.  asymptotic 
distribution  of  3^,“  and  propose  that  tests  be 
carried  out  on  the  basis  of  this  distribution.  As 
an  alternative,  we  propose  to  use  bootstrap 
samples  from  the  replicate-differences  density 
estimate  to  determine  whether  an  observed  value  of 
S„^  is  statistically  significant.  Work  is  needed 
to  determine  the  situations  under  which  this 
proposal  has  major  advantages. 

Consider  the  major  issues  involved  in  the 
choice  of  the  function  x.  For  lack-of-fit  tests 
derived  from  x-tests,  the  issue  of  bounded 
influence  does  not  arise  and  thus,  the  choice  of  x 
is  simplified.  For  lack-of-fit  tests,  the 
minimization  of  r(0)  under  the  full  model  is 
sirap'y  tilting  a  separate  location  estimate  to 
each  design  point.  Thus,  no  design  point  has 
higher  influence  than  any  other  and  no  choice  of  x 
provides  dependence  on  the  design  point.  Since  no 
design  point  is  downweighted,  a  single  design 
point  might  cause  rejection  of  the  fit  of  the 
model.  This  behavior  is  what  is  usually  expected 
of  lack-of-fit  tests. 

Another  issue  is  whether  to  choose  a  x  with  a 
redescending  if  function.  On  one  hand,  a 
redescending  ip  function  provides  superior 
performance  when  severe  outliers  are  present.  On 
the  other  hand,  with  a  redescending  function, 
the  minimum  of  r(6)  under  the  model  might  be  such 
that  a  design  point  is  completely  ignored  in  both 
the  estimate  of  9  under  the  model  and  the  value  of 
r(9)  that  appears  in  the  statistic  Sj,‘.  Thus,  the 
test  might  lead  to  acceptance  of  the  fit  of  the 
model  even  though  all  the  replicates  at  one  design 
point  have  very  large  residuals.  Belief  that  all 
the  measurements  at  one  design  point  can  be 
rejected  as  outliers  dues  not  seem  reasonable. 

One  way  out  of  this  dilemma  is  to  estimate  0  using 
a  X  that  has  a  redescending  il  function  but  to 
avoid  a  redescending  >1  function  in  the  r(6)  chosen 
for  the  statistic  Sp^.  In  other  words,  in  testing 
lack  of  fit,  the  x-test,  which  is  based  on  the 
same  x  for  estimation  and  testing,  might  be 
generalized  to  different  x  functions  for 
estimation  and  testing. 

The  validity  "f  a  bootstrap  hacod  on  the 
replicate-differences  density  estimate  also 
depends  on  the  choice  of  x.  As  we  have  already 
noted,  the  choice  of  a  robust  test  is  important 
for  validity.  Moreover,  whether  the  function  is 
redescending  may  be  have  an  effect  on  validity. 
Roughly  speaking,  in  the  case  of  stretched-tai 1 
error,  the  replicate-differences  density  estimate 
tends  to  have  shorter  tails  than  the  true  density 
estimate.  Thus,  for  a  redescending  iji  function, 
the  bootstrap  samples  may  have  fewer  observations 
that  have  no  influence  than  samples  from  the  true 
distribution  would  have.  For  a  non-redescending  li' 
function  such  as  Huber's,  the  location  of  outliers 
beyond  a  certain  point  makes  no  difference. 

The  design  of  the  experiment  also  has  a  bearing 
on  validity.  For  some  designs,  the  evidence  for 
lack  of  fit  comes  from  only  one  design  point.  An 
example  is  a  centerpoint  that  has  been  added  to  a 
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Absolute  Differences 


two'level  factorial  design.  This  is  the  case  in 
which  an  analysi-  t  the  experiment  based  on 
normality  mav  be  saved  by  the  central  limit 

theorem,  is  also  the  case  in  which  validity 

depends  i.  ^ne  percentiles  of  the  distribution  of 
the  location  estimate  for  a  single  design  point. 
For  other  designs,  the  evidence  of  lack  of  fit  is 
spread  over  many  design  points.  In  this  case,  the 
bootstrap  should  be  valid  over  a  broader  range  of 
error  distributions  and  sample  sizes. 

4.  APPLICATION 

In  this  section,  we  consider  a  set  of  kinematic 
viscosity  measurements  made  on  re-refined  oil. 

The  set  of  measurements  consists  of  3  measurements 
on  each  of  65  samples  of  re-refined  oil  (Weeks,  et 
al,  1983).  The  measurements  are  of  the  kinematic 
viscosity  at  100  “C.  In  this  application,  the 
actual  variability  of  the  samples,  which  is  of 
interest,  must  be  distinguished  from  the 
measurement  error,  which  is  not  normally 
distributed.  Each  oil  sample  was  measured  by 
three  different  laboratories.  However,  since  the 
same  standard  measurement  method  was  used  by  each 
laboratory  and  since  the  inter  laboratory  bias  was 
corrected  on  the  basis  of  reference  sample 
measurements,  the  model  given  by  (1)  is  plausible. 
The  non-normality  of  the  measurement  error  is 
manifested  in  two  ways.  In  the  set,  5 
measurements  differ  markedly  from  the 
corresponding  measurements  by  the  other  two 
laboratories.  Even  without  these  outliers,  a 
half-normal  probability  plot  of  the  differences 
between  measurements  on  the  same  sdiiiple  shows 
evidence  of  an  error  density  that  has  a  longer 
tail  than  the  normal.  Consider  a  statistic  for 
comparison  of  the  variability  of  oil  samples  from 
variou.s  sources.  An  appropriate  statistic  might 
be  computed  from  the  medians  of  the  three 
measurements  on  each  oil  sample.  The  contribution 
of  the  measurement  error  to  this  statistic  can  be 
assessed  by  means  of  a  bootstrap  based  on  the 
error  density  estimate. 

An  estimate  of  the  error  density  of  the 
kinematic  viscosity  measurements  was  computed. 
Before  considering  this  estimate  itself,  we 
consider  two  diagnostic  quant i le-quan^ i le  plots,  a 
plot  of  the  empirical  d i st  r i b\it  ion  of  the  absolute 
differences  between  replicates  on  the  same  oil 
sample  versus  the  distribution  of  these 
differences  obtained  from  the  estimated  density, 
and  a  plot  of  the  empirical  distribution  of  the 
second  differences  versus  their  distribution  as 
obtained  from  the  estimated  density.  The  plot  of 
the  absolute  differences  in  Figure  3  shows  that 
the  estimated  density  fits  the  data  well  except 
for  the  10  differences  that  involve  the  5 
outliers.  Similarly,  the  plot  of  the  second 
differences  in  Figure  4  shows  that  the  estimated 
density  fits  the  data  well  except  for  the  15 
second  differences  that  involve  The  5  outliers, 
four  of  which  are  high,  and  one  low.  Clearly,  t  h»^ 
estimated  density  does  not  account  for  the  ext  r«?me 
values  of  the  differences.  Th^'ce  figures  contain 
a  warning  about  the  interpretation  of  the 
estimated  density.  Also,  these  figures  suggest 
that  a  better  error  density  estimate  might  be 
obtained  by  increasing  Q  so  that  the  error  density 
estimate  can  better  represent  the  fails. 


Figure  3.  First  difference.s  between  replicate 
kinematic  viscosity  measurements,  empirical  versus 
estimated  distribution. 

Second  Differences 

1 


-1 


Figure  4,  Second  differences  between  replicate 
kinematic  viscositv  measurements,  empirical  versus 
estimated  distribution. 

Figure  5  shows  a  quantile-quantile  plot  of  the 
estimated  distribution  versus  the  normal 
distribution.  The  error  density  appears  to  be 
negatively  skewed,  hut  this  conclusion  must  be 
tempered  by  the  results  in  Figures  3  and  4.  One 
way  to  chock  the  effect  of  the  outliers  is  to 
remove  them  from  the  data  set  and  re-estimate  the 
error  density.  The  ipsult  of  this  is  shown  in 
Figure  6.  Both  Figures  5  and  6  show  the  same 
basic  shape  tor  the  error  densitv,  a  negative 
skewTiess.  Thus,  we  conclude  ♦hat  error 

density  estimator  largely  ignored  the  5  outliers. 
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Figure  6.  Error  density  estimated  from  kinematic 
viscosity  measurements  remaining  after  outlier 
removal . 


This  behavior  can  be  explained  by  the  factor 
exp(-x^/2)  in  the  Hermite  functions  that  appear  in 
(6)  and  (8).  These  viscosity  measurements  suggest 
the  appropriateness  of  our  error  density  estimator 
for  bootstrap  inference  for  robust  estimates. 

EstiHate<jl  Quantiles 


Figure  5.  Error  density  estimated  from  kinematic 
viscosity  measurements. 
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INTRODUCTION 

Almost  all  statistics  and  econometrics  texts  contain 
strong  admonitions  against  sequential  estimation  (or 
"data  mining").  These  admonitions  are  as  effective  as 
those  against  teen-age  sex  and  drug  abuse.  Applied 
econometricians  ignore  the  textbook  warnings  and  use 
sequential  strategies  because  they  believe  that  they 
yield  better  estimates.  In  spite  of  considerable  efforts, 
theoretical  statisticians  have  been  unable  to  analyze  the 
sampling  properties  of  these  strategies  under  realistic 
conditions  (see  Judge  and  Bock  (1978)  and  Judge 
(1984))'.  This  study  solves  this  problem  by  using  the 
bootstrap  (see  Efron(1982)  and  Efron  and  Gong(1982)) 
to  compute  the  sampling  distribution  of  different 
estimation  strategies. 

This  paper  examines  the  sampling  properties  of 
simple  multiple  regression  estimation  strategies  based 
on  variable  and  outlier  deletion.  With  only  small 
deviations  from  a  model  with  orthogonal  regressors  and 
normally  distributed  errors  there  are  substantial  biases 
in  the  standard  errors  and  t-«tatistics  reported  at  the 
last  stage  of  these  simple  strategies.  Since  the  bootstrap 
is  only  asymptotically  valid,  the  results  presented  here 
are  based  on  Monte  Carlo  repetitions  from  known  error 
distributions  to  eliminate  the  confounding  effects  of 
possible  small  sample  biases.  However,  for  all  of  the 
designs  considered  in  this  paper,  the  small  sample  biases 
in  the  nonparametric  bootstrap  are  negligible.  The  key 
conclusion  of  this  work  is  the  necessity  of  completely 
specifying  the  estimation  strategy  and  then 
bootstrapping  it  to  get  consistent  estimates  of  the 
sampling  distribution. 

The  bootstrap  technique  works  by  generating 
artificial  data  samples  and  computing  the  estimator  for 
each  sample.  This  technique  '■as  been  used  to  derive 
small  sample  properties  of  estimators  for  autoregressive 
linear  models  by  Freedman  and  Peters  (1984)  and  for 
Nested  Logit  models  by  Brownstone  and  Small(1988). 
Independently,  and  more  recently,  Kipnis  (1987)  and 
Veall(1987)  have  used  the  bootstrap  to  examine  the 
effects  of  various  estimation  strategies  in  linear 


regression  models.  Kipnis  looks  at  the  strategy  of 
choosing  the  subset  of  variables  to  maximize  R  in  a 
model  with  orthogonal  regressors,  but  he  does  not 
report  the  properties  of  individual  coefficient 
estimators.  VeaU  considers  these  properties  for  a 
stepwise  regression  strategy  applied  to  an  empirical 
example.  Although  his  results  are  qualitatively  similar 
to  those  in  this  study,  it  is  impossible  to  disentangle  the 
effects  of  possible  model  misspecification  from  the  biases 
caused  by  the  estimation  strategy. 

Since  most  applied  econometricians  use  some 
sequential  estimation  strategy  but  only  report  the 
biased  t-statistics  and  standard  errors  from  the  last 
stage,  this  study  concentrates  on  examining  the  size  of 
these  biases  for  a  number  of  known  models.  The  design 
of  the  experiments  concentrates  on  isolating  the  effects 
of  multicollinearity  among  the  regressors.  The  biases 
reported  here  are  caused  solely  by  the  use  of  sequential 
estimation  strategies.  This  study  is  not  designed  to 
explore  exactly  when  the  biases  will  be  large,  but  rather 
to  show  how  pervasive  large  biases  are  and  suggest  a 
methodology  for  removing  them.  In  particular,  there  are 
always  large  negative  biases  in  the  standard  errors 
estimated  from  the  last  round  of  even  simple  sequential 
procedures.  With  moderate  collinearity,  these  biases  are 
frequently  greater  than  100  per  cent. 

Although  it  would  be  interesting  to  further  isolate 
the  causes  of  the  biases  from  sequential  estimation, 
there  is  clearly  large  potential  gain  from  designing 
better  estimation  strategies  The  bootstrap  methods 
described  in  this  study  could  then  be  used  to  generate 
consistent  estimates  of  the  sampling  distribution  of 
these  strategies. 

EXPERIMENTAL  DESIGN 

The  experiments  are  designed  to  investigate  the 
sampling  properties  of  two  estimation  strategies, 
Ordinary  Least  Squares  (OLS)  and  Sequential  OLS, 
with  and  without  deletion  of  outliers  and  influential 
observations.  The  Sequential  OLS  (abbreviated  by 
SEQ)  procedure  used  in  this  study  consists  of  first 
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estimating  the  full  model  by  OLS,  deleting  all  variables 
(except  the  first  two)  with  absolute  T-statistics  less 
than  2,  and  finally  estimating  the  restricted  model  by 
OLS.  The  T-statistics  for  this  procedure  are  calculated 
from  the  usual  OLS  formulas  at  the  second  stage. 

There  are  7  regressors  and  100  observations  in  each 
data  set.  The  regressors  and  true  parameter  values  are 
initially  generated  as  independent  draws  fi:om  a  unit 
Normal  distribution.  Each  set  of  independent  regressors 
is  then  transformed  into  5  increasingly  collinear  data 
sets.  The  most  collinear  data  had  condition  numbers’  of 
approximately  135.  These  data  are  examples  of  strongly 
collinear  data,  but  such  data  are  quite  commonly 
encountered  in  applied  econometric  work.  The  highest 
bivariate  correlation  between  any  two  regressors  in  any 
of  the  experiments  is  0.9. 

For  each  of  the  four  experiments,  100  independent 
draws  of  the  regressors  and  true  parameters  were  made. 
For  each  of  these  draws,  and  each  of  the  5  collinear  data 
sets  based  on  them,  300  independent  "dependent 
variables"  were  generated  according  to 

Y  =  X/3  +  «, 

where  X  and  P  are  fixed,  and  the  t  are  independent  unit 
Normal  random  variables  or,  for  two  of  the 
experiments,  independent  unit  Normal  contaminated 
with  10%  independent  draws  from  a  Normal 
distribution  with  mean  0  and  variance  100’.  The 
sampling  distribution  of  each  of  the  estimation 
strategies  is  then  estimated  from  the  sample  of 
estimates  over  the  300  bootstrap  repetitions^.  All  of  the 
results  presented  in  this  study  pertain  to  the  estimation 
of  the  coefficient  of  the  second  of  the  two  variables 
which  were  always  kept  in  the  regressions. 

Each  experiment  considers  the  SEQ  and  OLS 
estimation  strategies  for  100  independent  draws  of  the 
true  parameter  values  and  6  increasingly  collinear 
regressor  matrices.  In  two  of  the  experiments,  outliers 
and  influential  observations  were  deleted  before  the 
estimation  strategies  were  calculated.  Outliers  are  those 
observations  with  standardised  residuals’  greater  than  2 
Influential  observations  are  those  with  "hat"  values 
greater  than  0.14.  These  measures,  and  choice  of  cutoff 
values,  are  fully  described  in  Belsley,  Kuh,  and  Welsch 
(1980). 


The  four  experiments  used  in  this  study  are  chosen 
to  investigate  commonly  used  variable  and  outlier 
deletion  strategies  across  a  wide  range  of  realistic  model 
settings.  The  experiments  are; 

RUNl:  OLS  and  Sequential  OLS  (SEQ)  are 
compared  where  the  dependent  variable  is 
uncontaminated  (e.g.  the  ts  are  all  draws  from  a 
unit  Normal  distribution). 

RUN2;  OLS  and  SEQ  are  compared  where  the 
dependent  variable  is  contaminated  (e.g.  10%  of  the 
fs  are  drawn  from  a  Normal  distribution  with 
variance  equal  100). 

RUN3:  After  first  removing  outliers  and  influential 
observations,  SEQ  and  OLS  are  compared  where  the 
dependent  variable  is  uncontaminated. 

RUN4;  After  first  removing  outliers  and  influent;al 
observations,  SEQ  and  OLS  are  compared  where  the 
dependent  variable  is  contaminated. 

100  different  basic  models  (initial  X  matrix  and  P 
values)  are  used  for  each  experiment  since  there  is  no  a 
priori  reason  to  expect  convergence  to  anything  over 
these  repetitions.  The  purpose  of  these  repetitions  is  to 
investigate  the  behavior  of  the  estimation  strategies 
across  different  models  and  to  insure  that  the  results  are 
not  artifacts  of  some  peculiar  X  oi  0  values.  Finally, 
since  the  matrices  for  transforming  the  basic 
independent  X  matrices  into  collinear  regressors  were 
fixed  across  all  of  the  runs,  the  repetitions  also  induce 
small  variations  in  collinearity  around  the  experimental 
design  values. 

RESULTS 

The  results  of  the  four  experimental  runs  are 
presented  as  percentiles  across  the  100  basic  data 
repetitions  in  Tables  1  -  4  .  The  numeric  suffixes  on  the 
row  labels  refer  to  the  degree  of  collinearity  in  the  X 
matrix.  The  suffix  1  refers  to  the  basic  independent 
data  set,  and  higher  suffixes  correspond  to  the 
increasingly  collinear  transforms  of  these  data.  The  rows 
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TABLE  1:  RUNl  RESULTS 
Uncontaminated  Errors,  No  Outlier  Deletion 


TABLE  2;  RUN2  RESULTS 
Contaminated  Errors,  No  Outlier  Deletion 


Percentiles  5 

25 

50 

75 

95 

CONDI 

1.36 

1.46 

1.52 

1.58 

1.66 

EFFl 

^.16 

-1.26 

-0.19 

0.01 

1.15 

BIOLSl 

-5.57 

-1.62 

1.80 

3.84 

6.56 

BISEQl 

-4.69 

-0.39 

2.14 

4.39 

7.61 

COND2 

5.16 

6.27 

6.72 

7.25 

8.11 

EFF2 

^4.96 

-16.56 

-6.31 

-0.69 

5.45 

BIOLS2 

-5.76 

-1.41 

1.04 

3.49 

6.34 

BISEQ2 

-3.30 

2.86 

6.44 

15.03 

33.20 

COND3 

10.34 

12.66 

13.65 

14.85 

17.00 

EFF3 

-69.69 

-39.66 

-20.66 

-5.02 

12.65 

BIOLS3 

-5.99 

-1.81 

1.45 

3.63 

7.69 

BISEQ3 

-2.12 

9.30 

23.45 

38.09 

72.92 

COND4 

29.35 

36.79 

40.72 

44.29 

51.00 

EFF4 

-120.04 

-66.68 

-30.98 

21.96 

89.89 

BIOLS4 

-7.20 

-1.17 

0.77 

3.76 

8.13 

BISEQ4 

5.26 

38.09 

63.30 

86.85 

117.39 

COND5 

58.17 

73.31 

81.51 

88.15 

102.25 

EFF5 

-132.83 

-66.24 

7.60 

60.75 

111.65 

BIOLS5 

-7.29 

-2.36 

0.28 

2.73 

6.37 

BISEQ5 

11.37 

32.72 

63.39 

96.11 

128.30 

COND6 

96.78 

121.72 

135.88 

146.74 

170.45 

EFF6 

-125.98 

-41.93 

27.43 

75.78 

137.30 

BIOLS6 

-5.38 

-1.79 

0.25 

3.18 

7.54 

BISEQ6 

11.59 

29.46 

59.25 

94.29 

141.82 

labelled  "COND"  give  the  condition  number  for  the  X 
matrices. 

The  other  three  rows  in  each  group  give  the 
properties  of  the  estimated  second  regressor  coefficient 
(recall  that  the  first  two  regressors  were  always 
included).  The  row  labelled  "EFF"  gives  the  percentage 
improvement  in  the  mean  square  estimation  (MSE) 
error  of  SEQ  versus  OLS:  positive  values  imply  that 
SEQ  is  a  better  estimator.  Note  that  the  MSE  here  is 
measured  relative  to  the  true  parameter  value  used  to 
generate  the  dependent  variables. 

The  remaining  two  rows  (prefixes  BIOLS  and 
BfSEQ)  in  the  Tables  give  the  percentage  bias  in  the 
T-statistics  for  the  two  estimation  strategies.  These 
biases  are  computed  by  comparing  the  average  of  the 
standard  OLS  T-statistics  from  the  last  stage  regression 


Percentiles  5 

25 

50 

75 

95 

CONDI 

1.39 

1.48 

1.53 

1.61 

1.67 

EFFl 

-8.35 

-3.92 

-2.35 

-0.03 

3.74 

BIOLSl 

4.22 

8.22 

10.92 

14.01 

19.10 

BISEQl 

4.44 

9.21 

12.09 

14.31 

19.33 

COND2 

4.60 

6.19 

6.91 

7.41 

8.29 

EFF2 

-52.82 

-28.19 

-13.30 

-2.16 

22.15 

B10LS2 

-0.08 

4.83 

8.23 

12.11 

17.77 

BISEQ2 

5,75 

15.39 

24.05 

35.97 

51.11 

COND3 

9.15 

12.46 

14.17 

15.19 

17.17 

EFF3 

-80.17 

-29.88 

1.44 

23.25 

64.68 

B10LS3 

-2.42 

5.13 

9.57 

12.99 

17.25 

BISEQ3 

14.96 

27.53 

38.16 

51.54 

66.36 

COND4 

27.48 

37.45 

41.24 

45.59 

52.45 

EFF4 

-57.50 

24.80 

57.65 

96.01 

134.18 

BIOLS4 

-0.31 

5.79 

9.00 

13.25 

19.81 

B1SEQ4 

13.37 

24.22 

41.12 

60.09 

109.46 

COND5 

54.10 

74.58 

82.41 

91.83 

105.18 

EFF5 

-65.79 

50.05 

89.00 

121.69 

148.44 

BIOLS5 

-2.89 

6.20 

10.44 

13.22 

18.97 

BISEQ5 

8.34 

23.00 

36.97 

59.85 

120.41 

COND6 

89.81 

124.32 

137.34 

153.21 

175.36 

EFF6 

^3.33 

56.68 

95.34 

128.77 

151.01 

BIOLS6 

-0.28 

6.01 

9.75 

13.25 

20.23 

BISEQ6 

14.80 

22.82 

38.61 

60.75 

108.48 

over  the  300  bootstrap  repetitions  with: 


where  bj  denotes  the  estimate  of  the  second  element  of 

at  the  ith  bootstrap  repetition  and  &  is  the  sample  mean 
of  the  bj.  Since  Tables  1-4  are  based  on  a  Monte  Carlo 

study  with  the  error  vectors  drawn  from  their  known 
true  distributions,  T  converges  to  the  true  T-statistic  as 
the  number  of  bootstrap  repetitions  gets  large. 

If  the  error  vectors  are  drawn  from  the  empirical 
distribution  of  the  residuals  from  a  regression  using  all 
of  the  regressors,  the  resulting  B  would  be  Efron’s 
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TABLE  3:  RUNS  RESULTS 
Uncontaminated  Errors  With  Outlier  Deletion 


TABLE  4:  RUN4  RESULTS 
Contaminated  Errors  With  Outlier  Deletion 


Percentiles 

5 

25 

50 

75 

95 

CONDI 

1.36 

1.46 

1.52 

1.59 

1.70 

EFFl 

^.49 

-1.30 

-0.28 

0.01 

1.10 

BIOLSl 

12.41 

17.07 

20.83 

23.80 

29.18 

BISEQl 

13.49 

17.57 

21.54 

24.51 

29.77 

COND2 

5.00 

6.18 

6.92 

7.51 

8.44 

EFF2 

-34.37 

-17.22 

^.16 

-0.72 

3.29 

BIOLS2 

12.71 

17.23 

19.60 

23.15 

26.02 

BISEQ2 

12.52 

21.60 

25.78 

34.27 

47.85 

COND3 

9.98 

12.65 

14.09 

15.48 

17.29 

EFF3 

-62.82 

-31.62 

-13.52 

-3.43 

23.82 

BIOLS3 

13.71 

16.80 

21.26 

24.35 

27.51 

BISEQ3 

17.54 

26.95 

41.97 

56.69 

103.73 

COND4 

28.38 

37.05 

41.91 

46.80 

52.97 

EFF4  - 

-109.17 

-48.01 

-16.94 

24.76 

63.45 

B10LS4 

12.20 

17.39 

21.52 

24.67 

28.99 

BISEQ4 

34.41 

53.72 

77.91 

101.03 

150.48 

COND5 

56.78 

74.05 

84.27 

93.65 

106.37 

EFF5  - 

-105.51 

-24.69 

15.17 

55.82 

88.88 

BIOLS5 

14.61 

18.04 

20.36 

22.94 

26.94 

BISEQ5 

28.99 

53.99 

79.04 

99.73 

162.33 

COND6 

94.66 

123.43 

140.65 

156.13 

177.41 

EFF6  ■ 

-104.89 

-15.20 

42.87 

76.36 

96.63 

BIOLS6 

12.88 

17.48 

21.56 

24.10 

28.15 

BISEQ6 

31.22 

51.77 

77.37 

95.24 

150.99 

nonparametric  bootstrap  estimator.  Since  all  of  the 
models  considered  here  satisfy  the  Gauss-Markoy 
assumptions,  Efron’s  (1982)  results  show  that  T 
converges  to  an  unbiased  test  statistic  for  the 
hypothesis  that  Plim  U  =  0  as  the  number  of  bootstrap 
repetitions  gets  large.  Note  that  these  estimators  do  not 
require  knowledge  of  the  true  model  so  that  they  can  be 
applied  in  real  situations.  The  small  sample  accuracy  of 
this  bootstrap  estimator  was  checked  by  rerunning  all  of 
the  experiments  with  the  error  vectors  drawn  from  their 
empirical  distributions.  In  all  cases  the  results  are 
almost  identical  to  the  Monte  Carlo  results  in  Tables 
1-4,  thus  justifying  the  use  of  the  nonparametric 
bootstrap  at  least  for  these  experimental  designs. 

Table  1  gives  the  results  of  the  first  experiment, 


Percentiles  5 

25 

50 

75 

95 

CONDI 

1.38 

1.46 

1.50 

1.61 

1.72 

EFFl 

-7.69 

-3.31 

-1.13 

-0.02 

2.29 

BIOLSl 

2.01 

7.78 

10.61 

14.16 

20.15 

BISEQl 

3.52 

8.73 

11.88 

14.93 

19.65 

COND2 

5.52 

6.13 

6.77 

7.46 

8.21 

EFF2 

-42.97 

-22.55 

-10.90 

-3.44 

16.02 

BIOLS2 

3.87 

7.75 

11.36 

15.01 

19.46 

BISEQ2 

12.29 

17.19 

22.82 

28.56 

47.48 

COND3 

10.94 

12.39 

13.76 

15.46 

16.99 

EFF3 

-79.91 

-42.80 

-18.56 

-3.41 

32.56 

BIOLS3 

-0.20 

7.59 

11.50 

14.43 

20.16 

BISEQ3 

12.41 

25.47 

35.28 

54.37 

76.82 

COND4 

31.35 

36.65 

41.38 

45.92 

51.32 

EFF4 

-101.95 

-39.32 

-2.31 

40.63 

79.29 

BIOLS4 

-1.20 

7.87 

11.28 

13.74 

19.80 

BISEQ4 

23.64 

42.65 

64.45 

86.20 

132.02 

COND5 

62.35 

73.44 

82.88 

91.75 

103.33 

EFF5 

-81.06 

-20.41 

36.51 

79.96 

120.77 

BIOLS5 

0.63 

6.50 

10.41 

13.19 

18.13 

BISEQ5 

18.83 

35.70 

60.52 

84.14 

135.69 

COND6 

103.80 

122.40 

138.21 

152.95 

172.52 

EFF6 

-103.36 

2.11 

50.34 

91.57 

135.16 

BIOLS6 

-0.61 

6.89 

10.24 

13.56 

20.03 

BISEQ6 

12.95 

29.32 

54.59 

87.97 

124.89 

RUNl,  with  uncontaminated  errors  and  no  outlier 
deletion.  As  textbook  theory  predicts,  there  are  no 
biases  or  differences  between  OLS  and  SEQ  if  the 
regressors  are  independent  (corresponding  to  the  suffix 
"1"  in  the  tables).  The  OLS  T-statistics  are  also 
unbiased  since  they  are  Uniform  Minimum  Variance 
Unbiased  estimators  in  this  situation.  However,  with 
even  mild  collinearity,  there  are  substantial  efficiency 
differences  between  OLS  and  SEQ.  More  striking  are 
the  increasingly  large  positive  biases  in  the  T-statistics 
for  the  SEQ  strategy:  these  biases  average  60  per  cent 
and  frequently  exceed  100  per  cent.  With 
multicollinearity  it  is  possible  to  considerably  improve 
estimation  efficiency  using  the  SEQ  strategy,  but  the 
resulting  T-statistics  will  certainly  be  overestimates. 
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FIGURE  1:  BIAS  IN  SEQ  T-STATISTIC,  RUNl 
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FIGURE  2:  BIAS  IN  SEQ  T-STATISTIC,  RUN3 
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Notes  for  Figures  1  and  2: 

These  figures  show  box  plots  of  the  percentage  bias 
in  the  T-statistics  for  the  SEQ  estimation  strategy.  Box 
plots,  originally  designed  by  Tukey,  are  common  tools 
in  exploratory  data  analysis.  They  are  described  in 
textbooks  like  Kitchens  (1987).  The  upper  part  of  the 
box  is  at  the  75th  percentile,  the  line  in  the  middle  of 
the  box  is  at  the  median  (50th  percentile)  and  the  lower 
part  of  the  box  is  at  the  25th  percentile.  The  upper 
whisker  is  at  the  "upper  adjacent  value,"  which  is  the 
closest  observation  to  the  75th  percentile  +  1.5  x  the 
Interquartile  Range  (the  75th  -  25th  percentile).  The 
open  circles  denote  outliers,  which  are  any  observations 
past  the  adjacent  values.  If  the  data  followed  a  Normal 
distribution,  then  we  would  only  expect  to  see  .7  of 
these  outliers  per  box  in  any  of  the  plots. 


RUN2  considers  the  same  estimation  strategies  in  a 
case  where  10  per  cent  of  the  errors  are  contaminated. 
The  results,  shown  in  Table  2,  are  similar  to  those  in 
RUNl.  One  difference  is  that  now  some  of  T-statistics 
for  OLS  for  coUinear  regressors  are  positively  biased, 
although  these  biases  are  much  smaller  than  the 
positive  biases  in  the  T-statistics  for  SEQ.  In  addition 
there  are  now  clearer  efficiency  gains  to  using  the  SEQ 
strategy  as  collinearity  increases. 

RUN3  has  the  same  data  generation  process  as 
RUNl,  but  the  OLS  and  SEQ  strategies  are  modified  by 
first  removing  outlying  and/or  influential  observations. 
As  expected  from  the  properties  of  the  data  generating 
process,  approximately  10  percent  of  the  observations 
are  removed  in  each  replication.  The  efficiency 
comparisons  between  SEQ  and  OLS  are  similar  to  those 
for  RUNl.  Note  that  now  the  T-statistics  for  both  OLS 
and  SEQ  are  biased  even  for  orthogonal  regressors.  The 
magnitude  of  this  bias  increases  with  collinearity  for 
SEQ,  but  remains  constant  for  OLS. 

RUN4  compares  the  same  estimation  strategies  as 
RUNS  for  data  generated  with  contaminated  errors. 
Since  there  are  now  some  serious  outliers  to  be  removed, 
the  estimators  should  perform  better  than  in  RUNS. 
Although  the  biases  in  the  T-statistics  are  lower  than  in 
RUN3  for  both  estimation  strategies,  the  biases  for  the 
SEQ  strategy  are  still  very  large  for  highly  collinear 
regressors. 

One  common  feature  of  all  the  results  presented  in 
Tables  1  -  4  is  the  large  variation  in  almost  all  the 
measures  across  the  different  data  designs.  Figures  1 
and  2  graphically  show  the  bias  in  the  T-statistics  for 
the  SEQ  strategy  in  RUNl  and  RUN2.  The  large 
magnitudes  of  the  efficiency  differences  and  biases 
clearly  suggest  that  there  is  large  potential  gain  from 
developing  better  estimation  strategies. 


CONCLUSIONS 

The  simulations  show  the  dangers  in  using  the 
results  of  common  estimation  strategies  for  hypothesis 
testing.  Although  this  study  only  considers  simple  linear 
regression  models,  I  expect  the  qualitative  conclusions 
to  hold  for  more  complex  econometric  models.  The 
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methodology  used  here  can  easily  be  applied  to 
analyzing  any  estimation  strategy  for  any  well-specified 
model*. 

This  study  also  demonstrates  the  feasibility  of  using 
the  bootstrap  to  generate  consistent  estimates  of  the 
sampling  distribution  of  estimation  strategies  for 
multiple  regression  models.  The  large  differences  in 
estimation  efficiency  between  the  OLS  and  SEQ 
strategy  show  that  there  is  large  potential  gain  from 
designing  better  strategies.  Even  if  one  only  uses  OLS, 
there  are  still  substantial  biases  in  the  T-statistics  when 
there  are  outliers  and/or  influential  observations. 

Although  it  would  be  interesting  to  explore  the 
conditions  where  SEQ  works  well  in  these  experiments, 
theoretical  work  (Belsley  et.  al.  (1980)  and  Judge  and 
Bock  (1978))  suggest  that  these  conditions  will  depend 
on  unknown  parameters.  Bootstrapping  allows 
consistent  estimation  of  the  sampling  distribution  of 
any  sequential  procedure,  which  allows  comparisons  to 
be  made  for  each  model  and  data  set. 
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course,  they  bear  no  responsibility  for  any  remaining 
flaws. 

'Although  some  of  the  simplest  experiments  reported 
here  could  be  analyzed  using  analytic  techniques,  the 
experiments  involving  outlier  and  influential 
observation  can  not. 

2The  condition  number  is  defined  to  be  the  ratio  of  the 
largest  and  the  smallest  eigenvalue  of  the  moment 
matrix  (X’X)  of  the  independent  variables.  See  Belsley, 
Kuh,  and  Welsch  (1980)  for  further  information  and 
justification  for  this  as  a  measure  of  multicollinearity. 

*A  randomly  chosen  10%  of  the  ts  are  independent 
normal  random  variables  with  mean  0  and  variance  100, 
and  the  rest  are  independent  unit  normally  distributed. 
The  10  per  cent  figure  is  based  on  the  fundamental 
"law"  of  survey  statistics  which  states  that  10  per  cent 
of  any  data  set  is  garbage. 

*In  a  number  of  cases,  600  bootstrap  repititions  were 
run  as  a  convergence  check.  There  were  no  significant 
changes  in  the  results  after  200  repititions. 

*As  suggested  in  Belsley,  Kuh,  and  Welsch  (1980),  these 
standardized  residuals  were  computed  by  first  excluding 
the  observation  in  question  from  the  standard  error 
calculations. 

•Computational  costs  may  become  prohibitive  in  more 
complex  settings.  All  simulations  for  this  study  were 
performed  on  an  8Mhz.  PC/AT  clone  with  a  total 
running  time  of  120  hours. 
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ABSTRACT 

This  paper  presents  a  computational  procedure 
and  the  numerical  results  for  studying  Che 
effects  of  outliers  or  other  anomalous  data  on 
maximum  likelihood  estimates.  This  procedure  is 
based  on  a  first  order  approximation  relying  on 
the  implicit  function  theorem.  The  numerical 
results  of  this  paper  are  given  for  a 
multivariate  signal-plus-noise  problem  with 
independent  non-ldentlcally  distributed  noise 
terms.  These  numerical  studies  will  Illustrate 
the  procedure. 

1 .  INTRODUCTION 

This  paper  presents  an  efficient  method  of 
determining  the  sensitivity  of  maximum  likeli¬ 
hood  estimates  (MLEs)  to  the  data  used  in  cal¬ 
culating  the  estimates.  This  method  is  much 
more  efficient  than  a  standard  simulation  that 
would  involve  several  recomputations  of  MLE  and 
is  useful  in  predicting  the  effect  of  outliers 
or  anomalous  data  on  Che  estimate. 

Maximum  likelihood  estimation  is  widely  used 
in  statistical  analysis.  It  is  found  in 
estimating  the  instrumentation  error  for  a 
guidance  system  or  a  navigation  system,  in  the 
geodetic  parameter  errors  for  an  earth  model, 
and  in  orbit  determination  for  satellites.  In 
addition,  the  MLE  is  also  utilized  in 
macroeconomic  modeling,  biometrics  problems,  and 
education.  Because  of  sophisticated  equipment 
and  the  complications  of  the  real  world,  the 
dimension  of  a  MLE  problem  can  be  very  large; 
therefore  to  have  over  a  hundred  parameters  in  a 
single  case  is  very  common.  Since  the  MLE  has 
no  closed  form  solutions,  it  is  very  costly  to 
find  a  MLE.  To  find  many  MLEs  for  data 
sensitivity  studies  is  even  harder.  Therefore, 
it  is  worth  the  effort  to  develop  a  method  which 
can  approximate  MLEs  in  a  quick  and  accurate 
fashion.  This  method  is  different  from  the 
sampling  techniques  discussed  in  Iman  [19801. 

Section  2  will  present  the  approximation 
method  named  the  First-Order  IFAP,  or  Implicit 
Function  Approximation.  The  general  IFAP  theory 
can  be  found  in  Spall  (1986).  Section  3 
presents  numerical  studies  on  Che  slgnal-plus- 
nolse  problems,  and  Section  4  is  a  brief 
conclusion. 

2.  AN  APPLICATION  OF  THE  IFT 

The  First-Order  IFAP  contains  the  first  two- 
terms  of  the  Taylor  expansion  around  the 
existing  MLE  using  the  implicit  function  theory 
(IFT).  Since  only  the  "first-order"  will  be 
discussed,  this  hyphenated  word  will  be 
omitted.  The  nonlinear  estimation  for  nonlocal 
sensitivity  can  be  found  in  Kalaba  [1986|.  As 


discussed  in  Spall  [1985],  IFAP  pertains  to  an 
approximation  framework  of  the  form  handled  by  a 
parameter  estimator,  that  is,  from  data  Xj, 
X2,...,x„,  Xj^  ~  N(p  ,  2  +  Pj^),  IFAP  can  be  used 
to  gain  insight  into  the  properties  of  the 

maximum  likelihood  (ML)  estimate,  6,  of  the 

vector  of  unique  and  relevant  parameters, 

B,  in  y  and  2 .  This  study  demonstrates  how  the 
current  software  can  be  used  to  study  the 
Influence  of  anomalies  or  outliers  within  the 

set  of  Xj^'s  on  the  estimate  B. 

Assume  that  B  is  found  as  the  root  of  the 
score  equation,  i.e. , 

(1) 

where  L  represents  the  log-likelihood  func¬ 
tion.  Since  (1)  involve.-^  a  term  like 
3L/32  =  0  and  since  it  may  be  that  2^0 
satisfies  3L/32  =  0,  IFAP  is  not  necessarily 
working  with  a  constrained  (positive  semidefi- 
nite)  estimate  of  2 .  A  further  restriction  of 
the  current  IFAP  formulation  is  that  all  Pj^'s 
are  assumed  to  exist  (i.e.,  (Pj^”^)~^  exists). 

We  believe  that  a  modification  of  IFAP  to  accom¬ 
modate  either  the  square-root  formulation  (i.e., 

the  procedure  for  ensuring  2  >  0)  or  the  so- 
called  information  formulation,  which  relies  on 
Pj^”^  Instead  of  the  nonexistent  Pj^,  would  be 
fairly  straightforward. 

Given  an  observed  set  of  data, 

X*  i  (x*’’^,x*'^,  ...,  xj^’^)^  and  P^.p^ . Pjj, 

the  present  software  computes  quantities  related 
to  the  first-order  expansion. 

S(x|x  )  ■=8(x  )  +  Tj(x  -  X  ) 

*  *  T 

where  T,  -  ld$/dx  J  ***i8  computed  using  the 

^  X  ,p  ■  * 

implicit  function  theorem  and  B  =  B(x  ).  The 

^  ★ 

various  quantities  computed  from  B  and  T^ 

include  several  unit-free,  normalized  measures 
of  sensitivity  which  will  be  discussed  in 
greater  detail  in  the  next  section. 

3.  SIGNAL-PLUS-NOISE  EXAMPLES 


There  are  three  subsections  in  this  section. 
Subsection  1  will  describe  how  that  data  was 
generated.  Subsection  II  shall  demonstrate  the 
accuracy  of  the  approximations.  Subsection  111 
uses  two  examples  to  show  the  IFAP  results. 
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I.  Introduction 


II.  Accuracy  Demonstration 


A  series  of  numerical  results  were  generated 
to  evaluate  the  IFAP  methodology  and  its  soft¬ 
ware.  These  results  use  the  same  set  of  input 


*l’^i^  for  i  =  1,  2,  ...,  25 


where 


15 


E  R 

P  =  A  aT  for  i  =  1,  2  ...,  25 

ill  .11 

Aj^  is  a  15  X  30  matrix  with  its  elements, 

} a^j} ,  generated  randomly  using  the  uniform 

distribution  over  (-1,  1). 

Xj^*  is  generated  using  normal  distribution 

N(0,  P^  +  £ ) 


15^ 


15" 


15 


A  nonconstrained  scoring  algorithm  is  used  to 
get  a  maximum  likelihood  (ML)  estimate, 

^  if  ^  if 

0  =8(x  ),  as  the  baseline  value.  The  param¬ 

eters  of  6  include  all  the  means 
(Uj,  U2*  •••>  Wjj)  af'd  the  nonzero  part  of  the 

following  covariance  matrix: 


‘1.1 

^2,1  ^2,2 


■9,1  ‘9,2 


■9,9 


0 


10,10 


o 


15,15j 

That  is,  B  contains  the  unique  elements  of  the 
upper  9  9  section  and  the  diagonal  el|iDents  of 

row  10  to  15.  For  a  change.  Ax,  from  x  ,  the 

*  it  It 

IFAP  program  generates  B(x  +Ax|x  )  the 

^  if 

approximation  to  6(x  +Ax).  Then,  for  each 

Ax  there  are  ML  estimates  0  corresponding  to 

(p  ,£  ).  By  comparing  B  and  6  ,  the  accuracy  of 

the  IFAP  approximation  can  be  studied.  Since 

^  it  a 

IFAP  is  a  first-order  estimator,  B(x  +Ax|x  ) 
and  the  ML  estimate  B  (x*  +Ax)  are  expected  to 

be  different.  However,  0  will  approach  0  as  A x 
approaches  zero. 


There  are  six  cases  in  this  section:  in  the 
first  case.  Ax  has  only  one  nonzero  element  in 
x;  in  the  next  three  cases,  the  Ax  represents  a 
certain  percentage  change  of  all  elements  in  one 

a 

of  the  x*s  with  respect  to  the  baseline  ;  for 
the  fifth  case.  Ax  represents  a  change  in  one 
positive  standard  deviation  for  all  elements  of 
one  x;  the  last  case  assumes  one  of  the  samples 

it  ^  * 

la  abnormal  and  that  6  =  B  (x  +Ax),  and  then 

the  IFAP  approximation,  0  (x  |x  +Ax),  is  com¬ 
pared  to  the  MLE,  B(x  ). 

CASE  1:  The  first  element  of  X|  was  changed  by 
100%  of  its  nominal  value  (xj),  i.e.. 

Ax  =  vec  [Ax  j  ,0, . .  • ,  O) 

where  vec  denotes  the  vector  form  of  a  matrix, 
the  zeroes  to  the  right  indicate  that  all  the 
elements  in  that  column  are  zeroes,  and 

Ax  =  [xj  ,0 . O)^. 

1 

Since  only  the  first  element  of  the  first  x 
was  changed,  the  parameteis  p ,  and  I  ,  ,  were 

*  *  ^  ^ 

most  affected.  The  (p  j  ,1  j  j)  values  are  (2,21, 

223);  the  IFAP  approximates  are  (2.57,  230);  the 
MLEs  are  (2.59,  228).  The  normalize  (defined 
below)  differences  between  the  IFAP 
approximation  and  nominal  values  are  (.11,  .09); 
the  differences  between  the  MLE  and  nominal 
values  are  (.12,  .08).  The  normalization  fac¬ 
tors  were  the  square  roots  of  the  appropriate 
diagonal  element  of  the  Fisher  covariance  matrix 
evaluated  at  the  true  p  and  I  values  used  in 
simulation.  The  changes  in  the  other  parameters 
are  small,  not  exceeding  ,03  times  their  stan¬ 
dard  deviations. 

CASE  2:  The  elements  of  Xj2  were  changed  by  50% 

* 

of  their  nominal  (Xj2^  values,  i.e.. 

Ax  -  vec  [  ...0,.5xj2.0,  ..  .J 


Ttbl*  I 

Ceaptrison  of  Ch«  IFA?  ind  KL  Solution* 
S0\  ch*n9ei  for  All  Clementt  of  Hi, 


—  — 1 

STATt 

BASE 

KEAN 

KEAN 

ESTIMATES 

NORMALIZED 

DELTA 

BASE 

COVARIANCE 

ESTINATTS 

NORMALIZED 

DELTA 

m. 

IFA? 

HL 

IFAF 

ML 

IFAP 

ML 

IFAP 

1 

2.21 

1.84 

1 .87 

-0.12 

-O.ll 

223. 

243. 

241 . 

0.26 

0.23 

2 

-7.12 

•7.40 

7.32 

-0.09 

-0.06 

178. 

183. 

184. 

0.08 

0.08 

J 

2.93 

2.81 

2.77 

-0.04 

-0.0b 

221 . 

22S. 

224. 

0.06 

0.05 

4 

-4.72 

-4. 94 

-4.75 

-0.08 

-0.01 

111. 

112. 

112. 

0.02 

0.02 

S 

-1.82 

•2.b2 

2.22 

-0.22 

-0.13 

2U  . 

241 . 

233. 

0.39 

0.26 

3.18 

3.2b 

3.42 

0.02 

0.06 

389. 

390. 

390. 

0.02 

C.Ol 

7 

♦3.19 

•3.76 

-3.84 

•0.18 

-0.20 

290. 

326. 

31S. 

0.46 

0.36 

e 

0.89 

0.71 

O.SS 

-0.06 

•0,11 

300. 

304. 

302. 

O.C 

0.04 

•4.82 

-b.l3 

4.89 

-0.10 

■0.02 

203. 

201. 

199. 

-0.03 

-0.05 

10 

1 .01 

0.87 

0.7? 

-0.04 

-0.09 

267  . 

274. 

274. 

0.08 

0.09 

11 

-3.44 

♦2.70 

-2.7% 

0.23 

0.21 

179. 

244. 

232. 

0.84 

0.69 

12 

2.89 

2. $8 

2.76 

-O.iO 

-0.04 

328 

778 

73  7  . 

0.11 

rt  07 

n 

•3.84 

♦3.92 

-3.85 

•0.02 

-0.00 

I6b. 

lb.  . 

l«b. 

0.00 

.  .0\, 

14 

-2.71 

-J  .98 

-2.03 

0.23 

0.21 

460. 

532. 

51S. 

0.90 

0.70 

IS 

-3.43 

-4.4% 

•4.21 

-0.26 

-0.18 

107. 

166. 

152. 

0.76 

0.57 
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where  0  represents  all  elements  In  the  column 
are  zeros.  The  IFAP  and  ML  estimates  are  given 
In  Table  1.  The  BASE  MEAN  and  BASE  SIGMA  in  the 
table  represent  the  baseline  parameter  esti- 

mates,  u  and  l  •  The  updated  ML  and  IFAF 
estimates  for  the  means  and  variances  represent 

^  it  *  I  * 

the  values  for  8^(x  +Ax)  and  $^(x  +  Ax|x  ). 

^  H  ^ 

The  DELTAS,  8^  -  8^^  and  6^  ”  •  ®te  normalized 

by  the  appropriate  standard  deviation  as 
described  in  CASE  1. 

As  shown  in  the  table,  the  actual  ML  values 
and  the  IFAP  estimates  are  fairly  close.  Also 
note  that  the  sign  of  the  deltas  are  the  same  in 
ML  and  IFAP  in  every  parameter. 

However,  it  is  not  immediate  from  the  table 
how  well  IFAP  works  as  a  predictor  of  the  rela¬ 
tive  sensitivities.  Tiiat  is,  can  IFAP  accu¬ 
rately  detect  the  parameter  that  is  the  most 
sensitive,  the  second  most  sensitive,  etc.? 
Therefore,  Figure  1  is  a  plot  that  compares  the 
ranks  of  the  estimates  in  terms  of  their  sensi¬ 
tivities.  Sometimes,  the  IFAP  rank  may  not 
match  the  appropriate  ML  rank,  even  though  the 
actual  numerical  differences  between  these  two 
estimates  are  small.  So,  an  error  bar  chart  was 
added  in  Figure  1  underneath  the  rank  plot. 

The  ranks  of  the  parameters  are  assigned  by 
their  values  of  the  normalized  deltas  from  the 
largest  negative  value  to  the  largest  positive 
(l.e.,  rank  "1”  is  assigned  to  the  largest 
negative  delta,  rank  "2"  is  assigned  to  the 
second  largest  negative,  etc.  until  tiie  largest 


positive  delta  is  reached).  Then,  the  IFAP  vs 
ML  ranks  were  plotted.  If  IFAP  and  ML  ranks 
were  perfectly  matched,  then  the  plotted  points 
would  stay  on  a  A5°  line;  on  the  other  hand,  if 
the  IFAP  and  ML  ranks  were  completely  unrelated, 
the  plotted  points  would  be  scattered  evenly 
throughout  the  area  plotted. 

The  error  bars  show  the  absolute  differences 
between  the  IFAP  and  ML  estimates 

|8j(x  +Ax|x  )  -8.(x  +Ax)|,  normalized  by  the 
appropriate  standard  deviation.  If  the  numeri¬ 
cal  differences  between  the  IFAP  and  ML  esti¬ 
mates  are  small  then  the  rank  agreements  are 
less  in^ortant. 

Note  that,  in  terms  of  ranks,  the  greatest 
discrepancy  between  IFAP  and  Ml  occurs  in  the 
middle  of  the  plot  (see  Figure  1).  These  param¬ 
eters,  however,  also  correspond  to  those  that 
are  least  sensitive  to  Ax  (and  thus  of  least 
interest),  and,  as  shown  in  the  error  bar  chart, 
those  for  which  the  normalized  errors  between 
IFAP  and  ML  are  smallest.  This  greater 
discrepancy  in  ranks  can  be  attributed  to  the 
Interest  variability  associated  with  such  small 
normalized  errors. 

CASE  3:  All  the  elements  of  Xj  were  changed  by 
50%  of  their  nominal  (Xj)  values,  i.e.. 

Ax  “  vec  [.5xj,  0,  ...,  Oj 

where  0  indicates  that  all  elements  in  that 
column  are  zeros.  This  case  is  similar  to 
CASE  2.  The  purpose  of  this  case  is  to  demon¬ 
strate  that  CASE  2  was  fairly  typical. 


0  5  10  lb  20  25  JO  0  5  tO  15  20  25  JO 

Rank  Predicted  by  IFAP  Rank  predicted  by  IFAP 


Rank  Predicted  by  IFAP  Rank  Predlcied  by  IFAP 


Figure  1  Ml./IFAP  Ranks  and  Normalized  Hi  lois  Figure  2:  ML/IKAP  Ranks  and  Normalized  Krrors 

50%  Changes  for  all  Klements  of  50%  Changes  for  All  Klements  of  x^ 
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Figure  2  shows,  i  he  rank  plot  and  the  error  bar 
chart  of  this  case>  The  pattern  In  this  figure 
Is  the  same  as  In  Figure  !•  The  lower  left  and 
upper  right  of  the  plot  have  many  points  lying 
on  the  45"  line  while  the  error  bars  at  center 
area  are  short  and  have  the  same  magnitude 
errors  as  In  CASE  2. 

Since  the  plot  and  chart  together  convey  the 
essential  Information  for  comparing  IFAP  and  ML, 
a  table  such  as  Table  1  will  be  omitted  In  this 
case  as  well  as  In  CASES  4  and  5. 


CASE  4:  All  the  elements  of  X|2  were  changed  by 
lOOZ  of  their  nominal  values,  l.e., 

r  It 

Ax  “  vec  [...,0,  Xj2>  0>  •••] 

In  comparison  with  CASE  2,  A x  Is  twice  as 
large:  this  case  also  shows  larger  differences 

on  the  error  bar  chart,  and  a  more  scattered 
rank  plot.  Figure  3  snows  that  the  largest 
error  In  the  chart  tripled  In  values,  and  there 
are  6  more  points  off  the  45"  line  In  the 
plot.  However,  It  Is  apparent  from  the  plot 
that  there  Is  still  a  strong  tendency  for  the 
points  In  the  rank  plot  to  lie  near  the  45" 
line.  The  IFAP  approximation  has  the  same  ranks 
as  the  MLE  at  the  lower  left  and  upper  right 
corners  of  the  plot.  The  points  off  the  45" 
line  are  concentrated  at  the  center  section  and, 
as  before,  correspond  to  smaller  normalized 
errors. 


CASE  5:  All  the  elements  of  xj  were  changed  by 
positive  one  standard  deviation,  l.e., 

Ax  -  vec[Axj,  0 . O] 


where 


Ax, 


^•1/2 

(^2,2-^^,)^ . 

P>P  lp,p' 

Some  of  the  changes  In  the  previous  cases  may 
be  small  In  comparison  to  the  standard  devia¬ 


tions  since  lOOZ  of  a  small  value  Is  still 
small.  The  changes  of  the  elements  in  this  case 
have  the  same  ratio  to  the  standard  deviations; 
therefore,  the  rank  plot  at  the  top  of  Figure  4 
Is  expected  to  be  more  evenly  spread  out  than 
the  previous  plots.  From  this  spread,  the 
points  In  the  plot  still  tend  to  stay  along  the 
45"  line  and  several  of  the  most  sensitive 
parameters  have  been  assigned  at  the  same  ranks 
in  both  IFAP  and  ML.  Referring  to  the  error  bar 
chart  at  the  bottom  of  Figure  4,  the  points  that 
tend  to  be  off  the  45°  line  in  the  plot  have 
smaller  errors,  less  than  one-tenth  of  a  stan¬ 
dard  deviation.  Therefore,  the  IFAP  approxima¬ 
tion  and  MLE  are  closely  matched. 


CASE  6:  Assume  that  all  the  x's  have  the  same 

•k  * 

*6  as  in  the  previous  cases  (1-5),  except  x^ 
is  replaced  by  Then  all  elements  of  ^ 

2x^  are  changed  back  to  original  values,  i*e., 


o 

O 

< 
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F'igurc  3:  ML/IKAP  Ranks  and  Normuli/od  f'>rors 
100%  Changes  for  all  Elements  of  x,, 
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Figure  4;  ML/IFAP  Ranks  and  Normalized  Krrors 

Positive  One  Standard  Deviation  Changes 
for  all  Elements  of  x, 
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ix  “  vec[  0,  0,  -Xj,  0,  ...]  The  two  examples  shown  in  this  section  are 

displayed  in  Tables  3  and  4.  Each  example  was 

This  case  was  designed  to  show  how  IFAP  would  do  generated  from  nine  similar  IFAP  runs.  Every 
when  one  of  the  x  was  an  outlier.  The  compari-  run  was  generated  from  the  same  baseline 
son  of  the  IFAP  approximation  and  MLE  is  shown 
in  Table  2, 


Tabla  2 

CoMparison  of  the  IFAP  end  HL  Solutions 
100%  changes  for  All  Clenents  of  x. 


STATS 

BASS 

KEAN 

BSTIKATES 

NORHALIZEO 

DELTA 

BASK 

COVASIANCC 

eSTlKATES 

HOttKAtIZtO 

oeLTA 

HL 

IFAP 

KL 

IfAF 

HL 

IFAP 

ML 

IFAP 

1 

1.43 

2.21 

1.71 

0.24 

0.09 

246. 

223. 

216. 

-0.28 

-0.36 

2 

-7.47 

-7.12 

7.12 

0.11 

O.ll 

161. 

176. 

172. 

-0.05 

-0.12 

3 

4.43 

2.93 

3.37 

•0.48 

-0.34 

338. 

221 . 

176. 

-1.51 

-2.10 

4 

'5.20 

-4.72 

4.64 

O.lS 

0.11 

116. 

111. 

103. 

-0.07 

•0-17 

S 

-2.26 

-1.82 

1.91 

0.14 

O.n 

224. 

211. 

201 . 

-0.!7 

-0.30 

6 

4.47 

3.18 

3.08 

-0.41 

-0.44 

500. 

389. 

356. 

‘1.44 

-1.66 

7 

-3.79 

-3.19 

3.37 

0.19 

0.13 

307. 

290. 

279. 

-0.22 

'0.35 

8 

1.92 

0.89 

0.95 

-0.33 

-0.31 

360. 

300. 

264. 

-0.77 

-0.98 

9 

-4.11 

-4.82 

4.89 

•0.22 

-0.25 

246. 

203. 

194. 

-0.55 

-0.66 

10 

2.70 

1 .01 

1.03 

•0.54 

•0.53 

470. 

267. 

202. 

-2.61 

-3.46 

11 

-3.54 

-3.44 

3.33 

0.03 

0.07 

178. 

179. 

179. 

0.00 

0.01 

12 

2.58 

2.69 

3.13 

0.10 

0.17 

331. 

326. 

328. 

-0.04 

-0.03 

13 

-4.23 

-3.84 

4.08 

0.15 

o.oa 

169. 

165. 

156. 

-0.05 

•0.16 

14 

-3.48 

-2.71 

2.78 

0.24 

0.22 

487. 

460. 

451. 

-0.34 

-0.45 

15 

-4.14 

-3.63 

3.76 

0.16 

0.12 

120. 

107. 

101 . 

-0.16 

-0.25 

The  baseline  value  in  Table  2  is 

''  It  *  * 

6  =  6  (x  -Ax).  The  MLE  have  the  same  values 


Peletlve  Changes  in  fi  Faraaeters  to  Their  Standard  Deviations 
for  100  Percent  Changes  in  x  (First  9  of  2i  Samples) 


SAIlPLF(right)  X 
PARAn£TER(down) 


0.04  -O.Ob  -0.07  0.14  0.11  -0.01  O.SO  0.26  0.43 

-0.36  -0.56  -0.12  0.28  -0.03  0.29  -0.09  '0.12  0.07 

0.53  0.22  0.31  -0.15  0.13  -0.04  0.23  0.07  -0.47 

-0.25  0.22  '0.09  O.lS  -0.42  0.20  -0.24  -0.07  0.26 

-0.03  -0.02  -O.ll  0.13  -0.21  0.09  0.40  0.16  -0.54 


0.33  -0.73  0.43  0.01 

0.07  -0.55  -0.11  -0.39 
-0.32  0.12  0.32  -0.19 

0.09  -0.03  0.24  0.21 

0.15  -0.42  0.53  0.20 

-0.27  -0.34  -0.08  -0.07 
O.SO  0.24  -0.19  -0.01 
O.06  *."•  C.w4  -0.J7 

-Oil  -0.21  -0.24  -0.03 
-0-23  -0.25  -O.II  0.35 


0.05  0.04  0.23  0.24 

0.98  l.l'*  n.05  O.b** 

1.42  0.36  0.90  0.25 

0.17  0.07  -0.05  0.17 

-0.12  0.08  0.04  -0  o: 


0.26  0.00  -0.00  -0.24  0.42 

-0.17  -0.06  -0.25  0.13  0.25 

-0.07  -0.04  -0.31  0.55  -0.16 

-0.35  -0.32  -0.17  0.15  0.22 

-0.45  -0.02  -0.22  -0.04  0.25 

0.12-0.20  0.08  0.32  0.09 

0.59  -0.42  -0,01  -0.07  0.26 

■7  36  -0,20  -0.42  -0.24  '0.00 
-0.19  -0.33  0.40  -0.26  -0.64 
-0.15  -0.15  -0.17  0.11  0.04 

3.60  2.3?  3.49  2.70  4,12 

0.23  0.02  1.17  0.54  0.49 

0.04  0.8)  0.03  0.02  0.21 

-O.n  0.00  0.14  O.OJ  0.17 
1  .52  0.21  0.43  -O.n?  0.15 

0,10  n.06  1,1?  n.?4  1.75 


as  the  previous  baseline  value  6 (x  ),  and  the 

*  *  * 

IFAP  approximation  is  6  (x  [x  -Ax).  The  nor¬ 
malized  deltas  are  the  differences  between  the 

*  * 

IFAP  or  ML  estimates  and  6  ;  they  are  then 
divided  by  their  standard  deviations  as 
described  in  CASE  1. 


4  1.21  ?  .41  I  .0?  0.26  -0.0)  -0.06  0,01  0.37  0.59 

ly.y  0.27  0.66  0.17  0.78  0.10  3.01  0.12  0.14  0.34 

I,.,  0.36  -0.08  0.57  0,16  0.14  0.09  0.67  l.5l  0.24 

1.. ,  0-00  -0.01  0.46  0.)h  0.43  0.5*  0,18  n.30  0,10 

t,,.,,  0.12  0.65  1.80  0.5:<  n.70  -O.Ol  0.25  0.05  0.39 

1.. ..,  0.18  0.44  0.06  0.15  -c.OC  0.17  -O.IO  0.39  0,09 

I,;.,,  1-94  0.20  0.08  0.25  1,46  1.1/  -O.Ol  0.0b  0.46 

t,,.,,  0.07  1.56  0.05  0.02  G.p  0.04  0.95  0.20  0,03 

C,4  I,  0.14  O.JO  'J.2J  O.W  -tl.Ob  0.52  0.57  0.56  2.26 

ti,.H  0.04  0.18  0.03  O.'a  U.12  O.OJ  0.16  -0.05  U.05 

M(|,  7.07  8.05  5.//  4.39  5.23  3.7b  5.90  4.4  '  1.93 


Table  2  shows  that  the  IFAP  and  ML  estimates 

are  near  one  another  even  though  6  (x  -Ax)  is 

far  from  6 (x  ),  and  that  the  (normalized)  deltas 
for  the  IFAP  and  ML  have  the  same  signs.  Thus, 
IFAP  performs  well  for  this  outlier-type  case, 
too. 

ill.  The  IFAP  Results/Interpretatlon 

There  would  be  several  ways  to  apply  IFAP  in 
actual  data  processing,  e.g.,  approximating  the 
estimates  and  studying  the  sensitivities.  This 
section  presents  two  tables  that  give  some 
insight  into  such  applications  of  IFAP.  Recall 
that  the  IFAP  program  uses  the  same  data  as  the 
ML  estimator. 

Most  sensitivity  studies  require  a  large 
number  of  runs.  Therefore,  a  study  was  made  to 
investigate  the  efficiency  of  the  IFAP  pro¬ 
gram.  The  study,  including  51  samples,  59 
states  and  1 5A  parameters,  shows  that  the  IFAP 
CPU  time  was  less  than  1/25  of  that  required  to 
generate  an  MLE  by  DL/scoring.  For  a  case  size 
like  this  large,  an  IFAP  run  took  about  10  CPU 
seconds  on  IBM  3083.  In  other  cases,  of  course, 
the  CPU  times  may  vary  according  to  the  number 
of  x’s,  states,  and  parameters. 


T.bl.  4 

Relative  Change*  in  8  pArnrtteters  to  Then  5t.drti.ldtd  Devidlion* 
for  ICO  ferrent  Changi-s  in  k  {Fum  9  of  25  Samples) 

SArfFLE(  right)  Xj  x,  Xj  x,  x^  x^  x.,  X|  x^ 
r8RAKETEft(dovn) 

W,  -0.03  0.08  0.09  -0.V3  -0.06  0  ''2  ''.49  -0.24  -0.42 

Vi  0.34  0.56  0.11  -0.29  0.03  -0.29  0.08  0.12  -0.07 

ii,  -0.55  -0.21  -0.34  0.14  -0.1)  0.03  -0.24  -0.07  0  45 

V,  0.22  -0,22  O.II  -0.16  0.41  -0.21  0.24  O.OH  -0.26 

y,  C.OO  O.Ol  0.11  -0.13  0.17  -008  -0.40  -0.16  0.55 

w*  -0.36  0.74  -0.4;  -0.02  -0.25  -0  01  U.Ol  0.23  -0.42 

V,  -0,07  0.55  0.13  0.38  0.|7  0.07  D.25  -0.13  -0.25 

w,  0.27  -O.W  -0.31  0.19  0  Oft  0.04  0 . 3 1  -  U .  54  O.lb 

W,  -0.09  0.01  -0.25  -0.21  0.32  0.32  0.17  -0.15  -0.22 

y,,  -0.16  0.43  -0.53  -0.20  0.48  O.UO  0.22  0.04  -0.24 

y,,  0.26  0.34  0.07  0.07  -0.14  0.21  -0.07  -0.31  -0.10 

-O.SO  -0.24  0  I?  -0,00  -O.^ft  It  4n  O.Ol  0.08  -0.2b 

y,,  -0,09  0.56  0.08  0.17  0.34  0.19  0.42  0.25  00 

U,,  O.ll  0  24  0.22  0.04  0.20  0.32  -0.40  0.2b  0.64 

y,^  0.20  0,24  0.12  -0.3/  0,15  0.15  O.lb  -0.10  -0.05 

lull,  3.28  4-57  3.06  2  51  3 . ‘'O  It  t  4b  2.75  4  09 

I,  ,  0,07  0  02  0.30  -0.25  0.21  0.02  1.18  -0.54  0.12 

If.,  -0.93  -1.17  -0,12  -0.68  -0,01  -0.84  -0.02  -0.02  -0.22 

I,.,  -I  31  -0,15  -2.10  -0.25  0  11  C’.OS  -0.13  -0.00  -0.79 

t,  ,  -0.15  -0.06  -0.17  -0.39  -1,42  -0  24  -0  41  0.02  -0.17 

0  15  -0,10  -0.30  0.01  -O.'.l  -C.i-'5  -1-14  -0.24  -1.72 

tg  ,  -1.16  -2.53  -1  86  -0  27  0.01  0  04  -0.05  -0,41  -0.54 

I,  ,  -0.24  -0.65  -0.35  -0.74  -0.08  -0.00  -0.12  -O.lb  -0.3b 

I,  ,  -C*  38  0  06  -0.98  -0.20  0.19  -n  ;4  -0  67  -1  46  -0.2b 

1.. ,  O.Ol  -O.Ol  -0  66  -0  3S  0.45  0.53  -O.lfl  -0.28  -0.08 

J,,  ,,  0.08  0  '2  -3.46  -0.50  0.79  0  Of  0  11  0.00  -O.tO 

1.,  ,,  -O  19  -0.43  O.Ol  -O.lb  -0  02  0  17  (j  U  -O  )'4  -O.'.O 

1.,  ,,  -I  94  -0  20  -0.03  -0.29  -1  51  -I  13  0  02  0.06  -0  48 

I,i  ,,  0  09  -I  53  -0.16  -0  03  -0  16  -0  03  -0  93  -0.20  -0  04 

1.. ..,  0  07  -0.34  -0.45  0  n  0  05  n.'.o  0  58  -0.53  2  20 

1.,  ,,  0  04  -0.16  -n  25  -n  84  -0  ll  -0  05  -0  15  0  05  0  Ob 

P,  6  82  8.12  U  2B  5  OR  ‘,.4  ‘8b  '09  4  44  7  ro 
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8  ,  X  ,  and  P^'s.  Each  column  corresponds  to 
the  one  sample  that  was  changed  by  100%  in 

generating  8  ;  the  other  samples  remain  at  their 
X*  values.  Table  3  was  generated  using  the 

X  's  and  P  's  as  In  Section  2.  Table  4  was 
^  ^  * 
generated  with  a  modified  x^,  the  elements  of 
which  were  doubled  in  comparison  with  the  values 
of  Table  3.  This  will  Illustrate  how  IFAP 
performs  in  the  presence  of  an  outlier  (sample 
3). 


In  Section  3.11  it  was  shown  that  the  current 
first-order  implementation  provides  an  accurate 
approximation  to  the  changes  in  parameter  esti 
maces  resulting  from  changes  in  the  data  of  a 
selected  x.  It  was  found  that  if  the  parameters 
were  ranked  in  order  of  their  sensitivities  to 
these  changes  in  data,  the  ranks  of  the  param¬ 
eters  as  given  by  IFAP  were  close  to  the  ranks 
as  given  by  recalculating  MLEs.  This  was  espe¬ 
cially  true  among  those  parameters  that  were 
most  sensitive  Co  data  changes,  which,  of 
course,  would  correspond  to  the  parameters  of 
most  interest. 


Tables  3  and  4  show  the  normalized  deltas, 

1. e.,  the  difference  between  8  and  6,  normalized 
by  the  Fisher-based  standard  deviations  as 
described  in  CASE  1.  The  columns  headed  by  Xj, 
X2,  ...,  X9  correspond  to  samples,  1,  2,  ..., 

9.  The  values  for  lul  and  KEI  at  the  bottom  of 
each  column  denote  the  sum  of  the  absolute 
values  of  the  entries  in  that  column. 

Table  3  was  generated  using  6  ~  0 (x  )  and 

X*  -  N(0,p^  +5;)  for  i  -  1,  2,  ...,  25.  Notice 
that  the  >11  's  are  about  twice  as  large  as  the 

lyl  's  within  all  samples.  The  differences 
in  111  and  lyl  may  be  largely  attributed  to  the 
relationship  between  ix  and  Ay  and  Ax  and  A1  . 

3  L 

Recall  that  the  equation  - —  »  0  implies  that 

Ax  and  Ay  have  a  linear  relationship,  while  the 
3  L 

equation  jj-  -  0  Implies  Chat  Ax  and  A1  have 
approximately  a  second-order  relationship. 

Table  4  was  generated  using 
8*  -  8  (x*  +  Ax^),  x*  -  N(0,  P  +  1  )  for  i  "  1, 

2,  . . .  ,  25  and  Ax^  “  vec[0,  0,  x^,  0,  ...).  The 

purpose  of  this  example  is  to  show  the  outcome 
of  IFAP  when  there  is  an  outlier.  Sample  3,  in 
the  system.  In  real  data  Analysis,  outliers  may 
dominate  the  result  and  lead  to  erroneous  con¬ 
clusions.  Although  IFAP  is  not  designed  specif¬ 
ically  for  isolating  outliers,  the  unusually 
large  value  of  111  and  the  ratio  of 
lEI  to  lyl  for  sample  3  reflect  the  fact  that 
this  sample  is  an  outlier. 

4.  CONCLUSION 

This  study  demonstrates  that  IFAP  can  be  an 
effective  and  efficient  tool  for  studying  the 
impact  of  anomalous  data  on  the  ML  estimates  of 
means  and  variances. 


Section  3.111  demonstrates  how  IFAP  might 
apply  in  actual  data  Analysis.  In  particular, 
two  tables  were  presented  that  illustrate  how 
IFAP  can  be  used  to  show  at  a  glance  how  sensi¬ 
tive  various  parameter  estimates  are  to  changes 
in  the  data  of  one  x.  As  an  aside,  it  was  found 
that  we  were  able  to  detect  an  outlier  x  by  its 
abnormal  impact  on  Che  estimate  of  the  variance 
terms;  we  have  not  yet  developed  a  rigorous 
theoretical  basis  for  this  observed  phenome¬ 
non.  We  found  that  it  was  approximately  25 
times  more  efficient  (in  terms  of  CPU  time)  to 
calculate  updated  IFAP  estimates  than  to  calcu¬ 
late  updated  MLEs  in  a  larger  size  problem. 

This  may  be  the  difference  between  feasibility 
and  Infeasibl llty  in  a  large-scale  data  sensi¬ 
tivity  study. 
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1.  INTRODUCTION 

Clinical  trial  designs  for  comparing  an 
experimental  treatment  with  an  appropriate 
control  commonly  use  several  investigators 
located  at  a  variety  of  medical  centers , 
all  operating  from  the  same  protocol. 

This  paper  is  concerned  with  treatment 
versus  control  comparisons  based  on  a 
dichotomous  response,  A  specified  event, 
termed  a  response  in  this  paper,  is 
observed  to  have  occurred  or  not  occurred 
for  each  subject  in  the  trial.  The 
context  for  a  statistical  comparison  of 
treatment  and  contr''!  is  a  stochastic 
model  assuming  treatment  and  control 
probabilities  of  response  at  each  center. 

DerSimonian  and  Laird  (1986)  observe 
that  the  control  and  treatment  response 
probabilities  will  likely  vary  from  center 
to  center,  or  vary  from  study  to  study  in 
a  meta-analysis  of  similar  clinical 
trials.  They  propose  a  random  effects 
model  assuming  that  a  center's  control  and 
treatment  response  probabilities  arc 
themselves  random  variables  with  a 
distribution  dependent  on  the  population 
of  centers  under  study. 

Let  <P,Q>  be  the  control  and  treatment 
response  probabilities  at  a  given  center. 
The  pair  <P,Q>  are  themselves  random 
variables  with: 

(1.1)  joint  distribution  of  <P,Q>  = 
g(p,q),  <P,q>  varying  in  the 
unit  square,  or  in  a  subset  of 
the  unit  square. 

Following  the  selection  of  <P,Q>, 
independent  samples  of  n  control  and  m 
treatment  subjects  are  observed.  If  X,  Y 
are  the  observed  frequencies  of  the 
control  and  treatment  responses,  then  X 
and  Y  are  assumed  to  have  independent 
binomial  p.d.f.’s  conditioned  on  the 
assumed  values  P=p,  Q=q. 

If  k  centers  are  planned  for  a 
multi-center  trial,  then  the  unobserved 
response  probabilities  <Pj,Qj>,  j=l,k  are 
assumed  i.i.d  from  g,  while  Xj ,Yj  are  the 
observed  control  and  treatment  response 
frequencies  from  nj  ,mj  subjects  at 
center  j . 

This  paper  explores  the  use  of  the 
bootstrap  method  (Efron,  1982)  to  compute 
significance  levels  and  confidence 


intervals  for  statistical  inference 
problems  generated  by  the  above  model . 
Parameters  are  defined  in  terms  of  the 
random  effects  density  g,  and  estimates  of 
these  parameters  are  generated  from 
estimates  of  g  based  on  the  observed 
response  frequencies.  An  important 
special  case  is  studied  first;  the 
proportional  odds  assumption,  where  the 
treatment  to  control  odds  ratios  are 
assumed  to  be  homogeneous  across  centers. 
Nonparametric  versions  of  the  bootstrap 
are  then  explored  for  the  more  general 
random  effects  model  when  the  proportional 
odds  assumption  cannot  be  used. 

Two  examples  will  be  given  illustrating 
the  use  of  these  methods.  One  example 
involves  a  multi -center  trial,  the  other 
example  is  a  meta-analysis  of  several 
trials  discussed  in  DerSimonian  and 
Laird's  paper.  For  this  meta-analysis  the 
sampling  unit  for  the  random  effects  model 
is  a  particular  study  rather  than  a  study 
site.  The  models  and  methods  used  here 
are  formally  the  same  for  a  meta-analysis 
as  they  are  for  the  multi-center  trial 
although  the  interpretation  of  the  results 
can  be  different. 

2,  PROPORTIONAL  ODDS  MODELS 

Suppose,  for  the  random  pair  <P,Q>,  the 
following  can  be  assumed: 

(2.1)  Q/(l-Q)  =  r'P/(l-P),  r  a  fixed 

positive  constant. 

Here  the  odds  for  occurrence  in  the 
treatment  group  is  a  constant  multiple  of 
the  control  odds  for  occurrence,  and  r  is 
the  constant  odds  ratio,  treatment  to 
control ,  Under  this  assumption  the  random 
pair  <P,Q'  must  vary  within  a  one 
dimensional  subset  (curve)  of  the  unit 
square . 

This  important  special  case  has  been 
studied  extensively  in  connection  with  the 
Mantel -Haenszel  test.  Note  that  the 
hypothesis  of  r-1  in  the  proportional  odds 
model  implies  Pj =Qj  for  every  center 
j=l,2,.,k.  This  is  the  usual  "no 
treatment  effect"  null  hypothesis  for  the 
Mantel  Haenszel  test.  Wittes  and 
Wallenstein  (1*^87)  discuss  approximations 
to  the  power  of  this  statistic  and  give  an 
excellent  reading  list  on  this  subject. 
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Under  the  proportional  odds  assumption 
inferences  about  the  fixed  odds  ratio  r 
(or  log  odds  ratio  Ir  =  ln(r)  )  do  not 
require  a  random  effects  model  even  though 
the  response  probabilities  can  vary 
considerably  from  center  to  center.  The 
conditional  likelihood  function  given  the 
assumed  values  Pl=pl,  P2=p2 . Pk=pk,  and 

qj  =  r*pj/(l-pj+r*pj) 

can  be  expressed  entirely  in  terms  of  the 
control  response  rates,  pj  j=l,..k,  and 
the  common  odds  ratio  r.  The  maximum 
likelihood  estimates  of  r  and  pi,  p2,...pk 
cannot  be  found  in  closed  form,  but  an 
elementary  nximerical  iteration  can  be  used 
to  calculate  the  estimates  and  their 
standard  errors.  Bootstrapping  the 
sampling  distribution  of  the  MLE  of  r,  and 
the  Mantel-Haenszel  estimate  of  r  given  in 
Fleiss ,  1981,  suggests  their  sampling 
distributions  are  very  similar  for 
examples  involving  moderate  within  site 
sample  sizes. 

Computation  of  the  MLE’s  permits  the 
use  of  the  likelihood  ratio  test  for 
assessing  the  goodness  of  fit  of  the 
proportional  odds  assumption.  In  practice 
this  test  should  be  made  at  a  level  higher 
than  p=.05  so  that  the  greater  sensitivity 
available  under  homogeneous  odds  ratios  is 
not  so  easily  assumed.  Examples  of  these 
methods  are  given  in  section  5.  The  next 
section  discusses  inferences  when  the  odds 
ratios  cannot  be  assumed  to  be 
homogeneous.  Because  sample  logits  tend 
to  be  more  normally  distributed  than  odds 
ratios,  log  odds  ratios  will  be  used  from 
now  on. 

3.  A  NONPARAMETRIC  RANDOM 
EFFECTS  FORMULATION 

Consider  again  the  random  response 
probabilities  <Pj,Qj>  j=l,...k  as  a  random 
sample  from  the  joint  p.d.f.  g  defined  in 

(1.1).  The  nu)l  hypothesis  of  no 
treatment  effect  proposed  here  is  given 

by: 

(3.1)  g(p,q)  =  g(q.P). 

Symmetry  of  the  joint  p.d.f.  about  p=q 
conveys  the  essential  meaning  of  no 
treatment  effect.  Note  that  any  real 
valued  transformation  having  the  property 


T(p,q)=-T(q,p)  yields  a  distribution 
symmetric  about  zero  for  T(P,Q)  under 
(3-1)  and  this  in  fact  characterizes 
(3-1).  In  particular,  T (p ,q) =ln(q/ (1-q) ) 

-  ln(p/(l-p))  =  log  odds  ratio  satisfies 
this  property.  Estimates  of  g  under  (3-1) 
are  proposed  in  section  4.  Estimates  of 
the  joint  p.d.f.  g  in  general  are 
formulated  now  in  terms  of  the  estimated 
control  and  treatment  response  rates: 

PHj  =  (Xj+.5)/(njn) 

QHj  =  (Yj+.5)/(mj+l),  j=l . k. 

Three  nonparametric  estimates  of  g  are 
considered  in  this  paper.  Each  of  these 
estimated  joint  p.d.f.s  assigns  all  of  its 
mass  to  the  subset  of  observed  response 
rate  pairs: 

SPRT  =  {  <PHj,QHj>:  j=l,..k  }. 

(3.2)  GHl(p,q)  =  1/k,  <p,q>  in 
SPRT, 

(3.3)  GH2(p,q)  =  c’nj*mj/(nj+mj ) , 
<p,q>  =  <PHj,SHj>,  where  c  i: 
the  constant  making  GH2  sum 
to  1.0  over  SPRT, 

(3.4)  GH3(p,q)  =  maximum  likelihood 
estimate  of  g  among  p.d.f.s 
assigning  all  mass  within 
SPRT. 

Estimates  (3.2),  (3-3)  are 
computationally  simple,  and  are  consistent 

when  k  and  the  min{nl , . . .nk ,ml . mk> 

diverge  to  infinity.  The  details  of  this 
are  not  relevant  here  because  while  the 
nj ’s  and  mj ’s  are  often  large,  k=number  of 
centers  is  usually  small. 

The  maximum  likelihood  estimate 
specified  by  (3-4'  can  be  obtained  from 
the  EM  algorithm  (Dempster,  Laird,  Rubin, 
1977),  but  a  closed  form  solution  is 
available  also.  The  likelihood  function 
that  must  be  maximized  has  the  form: 


mUe)  --  j  r  b„j, 

where 

iu  =  g(p«.qw) . 

b„j  =  pw^'j  (l-pwl^j’^'j  X 
qv^y j  ( 1  -qw )’'’'j  y j  , 


The  details  of  this  will  be  given  in 
another  paper. 

Once  an  estimate  of  g  is  obtained, 
namely  GH  using  either  (3-2),  (3-3)  ■  or 

(3.4).  then  estimates  of  the  log  odds 
ratio  can  be  formulated.  In  this  section 
the  log  odds  ratio  is  not  constant.  Let 

(3.5)  lr(g)  =  E[ln(Q/(l-0))  - 
ln(P/(l-P)):  g], 

where  the  expected  value  is  taken  with 
respect  to  g,  and  g  is  such  that  lr(g)  is 
finite.  Define 


conclusions  regarding  treatment  effect 
depend  on  the  selection  of  this  model,  as 
will  be  illustrated  by  examples  in  section 
5.  These  examples  will  also  illustrate 
the  price  in  precision  that  must  be  paid 
in  moving  away  from  a  proportional  odds 
assumption. 

The  general  algorithim  for  the 
bootstrap  used  here  is  as  follows  for  the 
random  effect  situation.  The  percentile  - 
t  method  of  generating  interval  estimates 
for  lr(g)  will  be  used  in  order  to  take 
advantage  of  any  reduction  in  coverage 
probability  error  (Beran,  1987). 


(3.6)  LRH  =  Ir(GH) 

as  the  estimate  of  the  mean  log  odds  ratio 
based  on  GH.  Note  that  LRH  is  just  a 
weighted  average  of  the  empirical  log  odds 
ratios,  the  weights  provided  by  GH.  The 
next  section  discusses  how  the  sampling 
distribution  of  LRH  can  be  approximated 
for  forming  confidence  intervals  and 
computing  .significance  levels. 

4.  APPLICATION  OF  THE  BOOTSTRAP 

The  role  of  the  bootstrap  here  is  to 
provide  an  approximation  to  the  sampling 
distribution  of  LRH,  where  both  this 
sampling  distribution  and  LRH  are 
determined  by  GH.  The  bootstrap 
distribution  can  be  an  imperfect 
substitute  for  the  unknown  sampli.ng 
distribution  of  LRH  determined  by  g,  the 
true  underlying  random  effects  p.d.f. 

^■lith  small  k  there  is  no  guarantee  that 
the  measure  defined  by  GH  is  anything  like 
the  measure  defined  by  g.  There  is  also 
the  issue  of  whether  a  g  exists ,  whether 
sites  chosen  for  a  clinical  trial  are 
representative  of  any  real  population, 
vith  some  p.d.f.  g.  Note  that  these 
problems  of  small  k  and  a  population  of 
sites  disappear  under  proportional  odds 
because  the  odds  ratio  MLE  was  driven 
entirely  by  the  conditional  likelihood 
given  the  assumed  values  of  the  site 
response  probabilities.  However,  when 
proportional  odds  cannot  be  assumed,  using 
GH  as  a  working  model  in  a  random  effects 
setting  can  be  more  credible  than  the 
usual  Mantel -Haenszel  tests  even  with  the 
small  k  and  the  arti factual  nature  of  GH. 

In  the  realm  of  heterogeneous  odds 
ratios,  the  conclusions  derived  from  a 
data  analysis  may  depend  heavily  on  the 
selection  of  the  method  for  estimating  g. 
The  bootstrap  readily  provides  answers  to 
the  inference  problems  within  any 
"computable”  empirical  model  selected  for 
analysis,  and  therefore  provides  the 
ability  to  assess  how  the  general 


(4.1)  Obtain  an  estimate,  GH,  as  in 

(3.2),  (3.3),  or  (3.4). 
Compute  LRH  and  SEH,  an 
asymptotic  approximation  to 
the  standard  error  of  LRH. 


SEH  =  SORT  (Z  GH?*(vh,.  + 

j  ^  ^ 

where 


vhj=l/(nj«PHj*(l-PHj))  + 
l/(mj*OHj*(l-QHj)), 

ci^(g)=  Variance  (ln(Q/(l-G))  - 
ln(P/(l'P)).-g). 

(4.2)  Sample  i.i.d.  k  pairs 
<PBj,eBj>  from  GH. 

(4.3)  For  each  j,  sample  nj ,mj 
binomial  trials  using  response 
probabilities  <FBj,QBj>  and 
note  XBj.YBj,  the  response 
frequencies,  j=l,2,...k. 

(4.4)  Using  the  data  obtained  from 
(4.3)  compute  the  estimate  GHB 
in  the  same  way  GH  was 
computed  from  the  original 
data.  Then  compute  LRHB  and 
its  approximate  standard  error 
SEHB  from  GHB.  Also  compute 
ZB=(LRH-LRHB)/SEHB,  the 
studentized  transformation. 

Repeat  (4.2,  3,  4)  NB  times  (I  used 
NB=600)  to  obtain  empirically  the  sampling 
distribution  of  LRH  when  using  GH  as  the 
random  effects  p.d.f.  The  empirical 
distribution  of  the  ZB’s  is  used  to  form  a 
percentiie-t  interval  estimate  of  lr(g). 

If  Z(2.5)  and  Z(97.5)  are  the  2.5th  and 
97.5th  percentiles  of  the  ZB’s  then 

(4.5)  (UH  +  Z(2.5)’SEH,  LRH  + 
Z(97.5)*SEH) 

is  an  approximate  95S  confidence  interval 
for  lr(g) . 
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To  obtain  a  test  of  the  null  hypothesis 
given  in  (3.1),  namely  g(p,q)=g(q,p) ,  then 
the  following  estimation  procedure  is  used 
for  g  assuming  the  null  hypothesis  is 
true.  Let  GH  again  be  an  estimator  of  g 
with  support  SPRT. 

(4.6)  First  translate  the  points  in 
SPRT  so  that  the  center  of 
mass  falls  on  the  line  p=q. 

PHTj  =  PHj  +  (meanlQH:GH]  - 
meaniPH:GH])/2 

QHTj  =  QHj  +  (mean[PH;GH]  - 
meaniQH:GH])/2 

(4.7)  Kow  define  the  estimated  null 
distribution  GHO,  GHO(p,q)  = 
(l/2)*GH(p,q).  for 
<p,q>=<PHTj,  QHTj>  or 
<QHTj.PHTj>  j=1.2,...k. 
GHO(p,q)=0  elsewhere. 

Note  that  GHO  is  just  the  original  GH 
equally  divided  among  the  translated 
points  and  their  reflections  through  p=q, 
and  by  construction  satisfies  the  null 
hypothesis.  A  reflection  of  the  points  in 
SPRT  without  a  translation  would  create 
too  wide  a  dispersion  for  GHO  if  most  of 
the  points  in  SPRT  were  far  from  the  line 
p=q.  This  would  result  in  an 
unnecessarily  heavy  tailed  null 
distribution  for  LRH. 

To  obtain  an  empirical  one  tailed 
significance  level  for  the  estimate  LRH 
repeat  steps  (4.2),  (4.3),  (4.4)  NB  times 
sampling  from  GHO  instead  of  GH.  Here  the 
empirical  distribution  of  the  LRHB’s  will 
be  symmetric  about  zero  and  can  be  used 
directly  to  compute  the  significance 
level.  If  large  values  of  LRH  are 
expected  with  a  treatment  effect,  then  the 
bootstrap  significance  level  is: 

(4.8)  phb  =  (number  of  LRHB’s 
exceeding  LRH)/NB. 

Again  if  k  and  the  nj’s,  mj’s  are  large 
then  phb  will  be  close  to  the  actual 
attained  significance  level  under  the  nulJ 
hypothesis.  With  small  k  this  method 
provides  an  internally  consistent 
approximation  within  the  context  of 
sampling  from  GHO.  Using  several  methods 
to  obtain  GH  as  discussed  in  section  3  it 
is  possible  to  obtain  several  values  of 
phb  to  see  if  the  overall  treatment  effect 
conclusion  is  affected  by  the  method  of 
estimation.  These  procedures  are 
illustrated  in  the  next  section. 


5.  TWO  EXAMPLES 

The  first  example  involves  a  test  drug 
for  treating  ulcers .  The  following  data 
was  obtained  after  two  weeks  of  treat.ment. 


STUDY 

SITE 

PLACEBO 

NO.  HEALED/N 

_ ^ 

DRUG 

MO.  HEALEO/M 

LOG  ODDS  RATIO 
DRUG  TO  PLACEBO 

I 

11/33 

33.3 

24/37 

_ 

64.9 

1.31 

2 

2/25 

8.0 

5/26 

19.2 

1.01 

3 

5/28 

17.9 

7/23 

30.4 

0.70 

4 

4/16 

25.0 

3/14 

21.4 

-0.20 

5 

7/17 

41.2 

6/21 

28.6 

-0.56 

6 

3/23 

23.0 

4/24 

16.7 

0.29 

This  drug  was  clearly  effective  after 
4  weeks  of  treatment.  The  question  here 
is  whether  an  efficacy  claim  is  warranted 
after  two  weeks . 

The  likelihood  ratio  test  for 
proportional  odds  yields  a  chi-square 
statistic  =6.09  with  5  degrees  of 
freedom,  clearly  not  significant  at  p=0.2. 
Assuming  proportional  odds,  the  likelihood 
ratio  chi-square  statistic  testing  the 
hypothesis  that  r=l  is  4.31  with  1  degree 
of  freedom,  significant  at  p=.05.  The 
estimated  common  log  odds  ratio  is 
.58  +-  .28  with  a  95%  confidence  interval 
of  (0.03,  1.13).  This  analysis  suggests  a 
claim  for  efficacy  relative  to  placebo  can 
be  made  after  two  weeks  of  treatment. 

What  is  disquieting  about  this 
conclusion  is  that  the  last  three  study 
sites  did  not  yield  overwhelming  evidence 
for  the  drug.  The  lack  of  significance  in 
the  test  for  proportional  odds  may  be  due 
more  to  small  sample  sizes  rather  than 
homogeneity  of  odds  ratios. 

The  three  estimators  given  by  (3-2), 
(3-3),  and  (3.4)  were  used  in  the  context 
of  a  random  effects  model.  The  following 
results  were  obtained  for  the  expected  log 
<xlds  ratio. 


TYPE  OF  ESTIMATE 
Equal  Hgnts  (3.2) 
n'«/(n‘*)  (3-3) 
MLE  (3  0 


TWO  HEDtS  ON  TNEATMDIT 
MEAN  LOG  OdOS  RATIO  ESTIMATES 


ESTIMATE  i  5E  CONFIDENCE  INTgVAL 


0.39  1  0.39 
0.94  1  0.39 
0.29  t  0.53 


(-0.29,  1.091 
(-0.14.  1  26) 
(-0.45.  0  94) 


The  standard  errors  were  obtained  via  the 
asymptotic  approximation,  and  the 
confidence  intervals  were  obtained  using 
the  percentile-t  bootstrap  method  given  by 
(4-5).  With  the  exception  of  the  MLE, 
these  asymptotic  standard  errors  were  in 
agreement  with  the  corresponding  bootstrap 
standard  errors.  The  bootstrap  standard 
error  for  the  MLE  was  0.38,  somewhat  lower 
than  0.53  given  above. 

Bootstrap  significant  levels  were 
obtained  using  (4.8). 
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ESTIMATED  P  LEVEL 

Equal  Wgts  0.15 

n*m/(n+m)  0.09 

MLE  0.27 

The  confidence  intervals  and  significance 
levels  do  not  support  a  claim  for  efficacy 
because  the  estimators  GH  tend  to 
emphasize  the  variability  in  the  log  odds 
ratios.  Note  that  the  Equal  Weights 
estimator  gives  more  weight  to  the 
negative  studies  than  the  second  estimator 
which  weights  the  sites  according  to  a 
sample  size  factor.  Note  also  that  this 
second  estimate  is  closer  than  all  the 
others  to  the  proportional  odds  estimate, 
because  the  sites  are  weighted  in  a  manner 
similar  to  the  Mantel-Haenszel  methM. 

The  random  effects  MLE  (3,4)  assigned 
weights  according  to  the  following 
proportions ; 


Most  of  the  weight  was  pulled  toward  site 
6,  where  the  results  for  the  drug  were  not 
spectacular.  Why  site  1  got  one  sixth  of 
the  weight  and  site  6  got  50X  of  the 
weight  is  a  subject  for  another  paper. 

In  summary,  this  first  example  seemed 
to  satisfy  the  proportional  odds 
assumption  which,  when  applied,  led  to  a 
conclusion  of  drug  efficacy  after  two 
weeks  on  treatment.  This  conclusion 
relied  heavily  on  proportional  odds 
because  all  three  random  effect  analyses 
yielded  nonsignificant  evidence  for 
efficacy. 

The  second  example,  discussed  by 
DerSimonian  and  Laird,  is  a  meta-^alysis 
of  placebo  controlled  trials  testing  the 
effectiveness  of  cimetidine  for  healing 
ulcers  (Winship,  1978).  The  following 
data  was  taken  from  this  study. 


This  example  is  interesting  because  the 
proportional  odds  assumption  is  rejected 
by  the  data,  but  this  should  not  stand  in 
the  way  of  observing  that  cimetidine  was 
significantly  more  effective  than  placebo. 
The  chi-square  test  for  proportional  odds 
was  15.85  with  7  degrees  of  freedom, 
significant  at  the  .05  level.  Since  a 
common  log  odds  ratio  is  rejected  by  this 
data,  estimates  of  a  mean  log  odds  ratio 
are  given. 

effect  of  cimetidiee 

MEilE  UX  ODDS  SATIO  ESTIHATES 

TYPE  OF  ESTIHATE  ESTIKATE  t  SE  COHFIDEWCE  IFTEBYAl. 

Equal  ngqts  (3.2)  1.79  i  0.30  (1  31.  2.32) 

n>/(n..)  (3.3)  l.Al  *  0.37  (0  77,  2.13) 

HLE  (3  41  1  74  t  0.39  (1  21.  2.06) 

As  evidenced  by  the  estimates  and  the 
confidence  intervals ,  the  estimated  mean 
log  odds  ratio  is  sufficiently  far  away 
from  zero  regardless  of  which  methcxl  of 
estimation  is  used.  All  three  bootstrap 
significance  levels  were  less  than  .001. 

The  estimate  (MLE)  of  the  log  odds 
ratio  under  the  erroneous  assumption  of 
proportional  odds  is  1. 41  0.17.  In  the 

random  effects  model  the  conclusion  of 
cimetidine  efficacy  is  still  apparent  even 
after  paying  a  substantial  penalty  in  the 
standard  error.  Note  that  again  the 
second  estimator  (weights  prop,  to 
n*m/(n+m)  )  gives  a  value  similar  to  the 
MLE  under  proportional  odds. 

The  maximum  likelihood  estimate  (3.4) 
of  the  random  effect  distribution  was: 

STUDY  3  5  6  7  8 

MLE  .062  .225  .036  .529  .148 

Studies  1,2,  and  4  received  zero  weight. 

The  SAS  programs  used  in  this  paper  are 
available  by  request  from  the  author. 


FIACCBO 

study  ho.  heaieo/n  e 

1  8/19  «■' 

2  5/lA  35.7 

j  12/20  M.O 

4  5/18  27,8 

5  7/24  29.2 

6  4/21  19.0 

7  16/42  38.1 

8  55/142  38.7 


DRUG 

NO.  HEALEO/N 

_ %_ 

100  0005  RATIO 
noUG  TO  FlACtBO 

16/19 

B4.2 

1.99 

26/30 

86.7 

2.46 

17/20 

65.0 

1.33 

17/20 

85.0 

2.69 

47/65 

72.3 

1.85 

13/21 

61.9 

1.93 

36/43 

83.7 

2.12 

74/130 

56.9 

0.74 

90 


REFERENCES 

Efron,  B.  (1982)  The  Jacknife,  the 
Bootstrap,  and  Other  Resampling  Plans. 
SIAM,  Philadelphia. 

DerSimonian,  R.  and  Laird,  N.  (1986) 
"Meta-analysis  in  clinical  trials". 
Controlled  Clinical  Trials,  7:  177-188. 

Wittes,  J.  and  Wallenstein,  S.  (1987)  "The 
power  of  the  Mantel-Haenszel  test". 

Journal  of  the  American  Statistical 
Association,  82,  400:  1104-1109. 

Winship,  D.  (1978)  "Cimetidine  in  the 


treatment  of  duodenal  ulcer". 
Gastroenterology  74:  402-406. 

Fleiss,  J.  (1981)  Statistical  Methods  for 
Rates  and  Proportions .  John  Wiley  &  Sons, 
Hew  York. 

Dempster,  A.,  Laird,  N. ,  Rubin,  D.  (1977) 
"Maximum  likelihood  from  incomplete  data 
via  the  EM  algorithm" .  J.  Royal 
Statistical  Society  B.  39:  1-38. 

Beran,  R.  (1987)  "Prepivoting  to  reduce 
level  error  of  confidence  sets". 
Biometrika ,  74,  3:  457-468. 


91 


Bootstrapping  the  Hixed  Regression  Model  with  Reference  to 
the  Capital  and  Energy  Cooplementarity  Debate* 

Baldev  Raj,  Wilfrid  Laurier  University 


1.  httroduction 

The  estimation  of  the  partial  Allen  elas¬ 
ticity  of  substitution  between  energy  and  capi¬ 
tal  in  the  manufacturing  process  has  been  the 
subject  of  a  number  of  studies.  The  results 
from  these  studies  have  not  always  been  in 
agreement.  For  example,  Berndt  and  Wood  (1975) 
found  that  capital  and  energy  were  complements, 
while  Griffin  and  Gregory  (1976)  and  Pindyck 
(1979),  found  that  capital  and  energy  are  sub¬ 
stitutes.  The  implications  of  energy  and  capi¬ 
tal  complementarity  is  that  ceteris  paribus, 
higher  priced  energy  will  not  only  dampen  its 
own  demand,  but  also  the  demand  for  new  invest¬ 
ment  in  plants  and  equipment. 

A  number  of  avenues  for  reconciling  these 
conflicting  empirical  results  have  been  explor¬ 
ed  in  the  literature.  For  example,  it  has  been 
suggested  that  the  use  of  time  series  versus 
cross-section  data  lead  to  different  results; 
studies  that  use  time  series  data  capture 
short-run  factor  relationships  while  studies 
that  use  cross-section  data  measure  the  long- 
run  factor  relationships.  Others  have  argued 
that  there  is  a  need  to  disaggregate  capital 
inputs  into  physical  and  working  capital.  The 
hypothesis  is  that  while  physical  capital  is 
complementary  to  energy,  working  capital  is  a 
substitute  for  energy.  Others  have  stressed 
the  need  to  exclude  taxes  from  the  capital 
working  service  price.  Similarly,  the  need  to 
use  four  inputs  instead  of  three  has  been  sug¬ 
gested.  These  and  other  arguments  are  reviewed 
by  Solow  (1987) . ‘ 

In  this  paper  we  examine  the  sensitivity 
of  the  energy-capital  complements  issue  by 
estimating  the  partial  Allen  elasticity  of  sub¬ 
stitution  between  inputs  i  and  j  (o^j)  under 
stochastic  constraints’  on  the  coefficients  of 
the  conditional  input  demand  (CID)  functions. 
The  stochastic  constraints  are  imposed  corres¬ 
ponding  to  homogeneity  and  symmetry  hypotheses; 
the  estimates  of  o^j's  are  obtained  by  using 
time  series  data  covering  the  period  from  19’i7- 
71.  The  data  are  obtained  from  Berndt  and  Wood 
(1975).  A  novelty  of  this  paper  is  the  use  of 
the  bootstrap  (Efron,  1979)  to  estimate  the 
standard  error  of  the  estimate  of  A  case 
for  using  stochastic  constraints  instead  of 
fixed  (or  exact)  constraints  has  been  made  by 
many  researchers  including  Tsurmi  et  al.  (1986) 


and  Ilmakunnas  (1986).  It  can  be  argued  that 
the  use  of  exact  constraints,  which  are  a  spe¬ 
cial  case  of  stochastic  constraints  approach 
are  both  restrictive  and  unnecessary.  Our  ex¬ 
amination  builds  on  the  papers  by  Freedman  and 
Peters  (1984)  and  Ilmakunnas  (1986)  who  have 
used  similar  methods  to  those  in  this  paper  in 
a  different  but  related  context.  We  estimated 
Oij's  by  the  mixed  estimation’  (MR)  method 
(Theil  and  Goldberger,  1961)  to  show  that  the 
estimates  of  are  sensitive  to  choice  of  a 
key  parameteric  value  in  the  stochastic  con¬ 
straints.  This  parameter  may  be  interpreted  as 
a  coefficient  of  stickiness  towards  homogeneity 
and  symmetry  hypotheses.  Our  results  show  that 
when  the  stickiness  coefficient  is  assigned  a 
value  higher  than  those  in  the  sample  the  esti¬ 
mate  of  can  be  positive  instead  of  nega¬ 
tive.  Further,  its  75%  confidence  intervals 
[Ojjg  -  '-60  3  posi¬ 
tive  value  for  the  Oj^  either  when  the  standard 
error  of  Oj-g  (SE)  from  the  standard  asymp¬ 
totics  or  bootstrap  (Efron,  1979)  is  used.‘ 
This  result  shows  that  energy-capital  substitu¬ 
tability  cannot  be  ruled  out  for  this  configu¬ 
ration  of  the  stickiness  coefficient  which 
might  be  interpreted  to  reflect  higher  per¬ 
ceived  or  real  uncertainty,  asymmetric  informa¬ 
tion  or  institutional  stickiness  faced  by 
firms.  The  confidence  intervals  of  Oj^  con¬ 
tinue  to  include  positive  value  of  elaticity 
when  fatter-tailed  errors  are  considered.  The 
fat-tailed  errors  are  said  to  arise  where 
sample  data  include  unusual  events  such  as  oil 
price  shock,  oil  emgargo,  etc.  (Taylor,  1983). 

The  paper  is  organized  as  follows:  follow¬ 
ing  this  section  we  present  the  model  and  des¬ 
cribe  the  MR  estimation  technique.  In  Section 
3  we  present  the  results  and  their  discussion; 
this  section  also  includes  a  brief  review  of 
the  bootstrap  idea.  Final  remarks  conclude  the 
paper . 

2.  THE  MODEL  AND  MR  E.STIMATION  METHOD 

The  CID  functions  for  the  transdental 
logarithmic  (Cristensen  et  al  1971)  unit  cost 
function  are  given  by: 

(1)  S-  =  a-  +  Z  g-j  In  Wj  +  £. 

where  is  the  cost  share  of  input  i  repre¬ 
senting  labor  (L) ,  capital  (K) ,  energy  (E)  and 
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material  (M)  and  the  represent  the  error  in 
the  ith  equation.  We  assume  that  Ee^=0  and 
Ee^Cj=  for  all  i  and  j  =  L,  K,  E  and  M.  The 
cost  minimization  hypothesis  imposes  the  follow¬ 
ing  set  of  exact  restrictions  on  the  parameters 
of  the  CID:  (I)  X  P-j  =  0;  (II)  p.^.=  Pj-  for  all 

i  ^  j  (III)  X  a-^=  1  and  X  p. .  =  0. 
i  ^  i 

The  restrictions  (I)  to  (III)  are  commonly 
known  as  homogeneity,  symmetry  and  additivity 
constraints  on  the  CID  functions. 

The  additivity  constraints  are  easily  in¬ 
corporated  into  equations  (1)  by  dropping  one  of 
the  share  equations.  We  shall  follow  this  con¬ 
vention  by  dropping  the  material  input  equation 
and  wiring  the  remaining  3  equations  compactly 


(2)  y  =  (I  *  X)  p  +  e 

where  y  is  a  3nxl  vector  of  n  observations  on 
the  input  cost  shades  with  y'  =  S£^2<  •••> 

®Ln’  ^Kl’  ^K2’  ■•••  ^En^  ’  ^  matrix  of 

observations  on  variables  on  the  right-hand 
side  of  equation  (1)  with  the  t-th  row  =  (1, 
In  w^j ,  In  Wj^^.,  In  Wg^,  In  w^^.)  ,  P  is  a  15x1 
vector  of  parameters  of  the  CID  equations  with 

P'  =  ^LL*  ^LK-  ^LE-  ^LM-  “K’  Pkl . ^EM^ 

and  e  is  a  3nxl  vector  of  observations  on  the 

errors.  The  vector  €  is  assumed  to  be  distri¬ 
buted  with  0  mean  and  covariance  matrix  Eee’  = 
X  9  I  where  X  =  ((Xj^-))  for  i,j  =  L,  K,  E  and  M. 
The  stochastic  constraints  of  the  homogeneity 
and  symmetry  constraints  (I)  and  (II)  can  be 
compactly  written  as: 

(3)  RP  =  u 

where  R  is  a  6x15  matrix  whose  elements  are 
specified  by  the  homogeneity  and  symmetry  con¬ 
ditions,  P  is  a  15x1  vector  of  coefficients 
defined  above  and  u  is  a  6x1  disturbance  vector 
such  that  Eu  =  0  with  Euu'  =  0  where  <l>  is  a  pos¬ 
itive  definite  matrix.  We  follow  Ilmakunnas 
(1986)  in  using  a  convenient  parameterization  of 
4>  such  that  <I>  =  I ;  the  parameter  o^  repre¬ 
sents  the  degree  of  stickiness  towards  homogen¬ 
eity  and  symmetry.  As  approaches  zero  the 
stochastic  constraints  tend  to  become  exact  con¬ 
straints  . 

The  MR  estimator  of  P  in  (2)  under  stoch¬ 
astic  constraints  (3)  is  given  by 

(^)  b  =  (X"^  »  X'X  +  ^  R'R)'^  (X"^  »  X')  y 

y. 

where  X  =  (Y  -  XB)  '  (Y  -  XB)/n  is  a  consistent 
estimation  of  X,  Y  is  an  nx3  matrix  with  t-th 
row  (Sgj,  Sj^j,  Sgj.)  and  B  is  a  5x3  matrix  of 
coefficients  of  cost-shares  equations  for  L,  K, 
and  E.  We  estimate  b's  using  an  iterative  meth¬ 
od  until  the  estimated  values  converge  (see 
Berndt  and  Wood,  1975). 


The  asymptotic  variance-covariance  of  b  is 
given  by 

(5)  V(b)  =  (X"^  9  X'X  +  1/0^  R'R)'^ 

It  is  easily  verified  that  the  covariance  matrix 
(5)  is  also  the  mean  square  or  risk  matrix  of  b 
since  the  stochastic  constraints  are  assumed  to 
hold  on  the  average  (cf.  Judge  et  al.,  1985,  pp. 
58-59) . 

The  estimates  of  o^j  can  be  obtained  from 
the  formulas: 


2>ij  =  (bij  ^  SiSj)  /  s.s. 

°ii  =  t^ii  "  s’  -  Si/s’ 


for  i  j 


where  and  Sj  are  average  values  of  cost 
shares  for  inputs  i,j  =  L,  K,  E,  M  and  bj^j's  are 
the  MR  estimates  of  P^jin  the  CID  equations  (1). 
The  asymptotic  standard  errors  of  may  be 
obtained  from 


SE(d.  .) 


[V(b..)/S’  S]]''’ 


3.  THE  BOOTSTRAP  IDEA  AMD  MR  ESTIMATES 

3.1  The  Bootstrap  Idea 

The  bootstrap  is  a  distribution-free  method 
of  determining  the  accuracy  of  the  parameters  of 
a  model.  The  bootstrap  theory  is  discussed  in 
detail  by  Efron  (1979,  1982).  A  survey  of  the 
bootstrap  theory  and  applications  is  provided  by 
Efron  and  Tibshl rani  (1986). 

The  bootstrap  standard  error  can  be  used  for 
calculating  standard  confidence  intervals  (Cl)  of 
from  formula  ±  t^j  SE(o£j)  where  o^j  is 
an  estimator  of  the  parameter  ,  SE(oj^j)  is 
the  bootstrap  SE  of  o^j ,  and  t^j  is  the  100a 
percentile  point  from  the  t-distribution. 

The  bootstrap  idea  in  the  context  of  stand¬ 
ard  regression  model  E(yj^)  =  P^t  ?^P  X^jraay  be 
described  as  follows.  Suppose  we  have  n  obser¬ 
vations  on  the  dependent  variable  y  and  the 
regressors  (x^,  X2,  ....  Xp) .  Further,  suppose 
that  the  regression  errors 

i  =  1,  2,  ....  n  are  from  an  unknown  distribu¬ 
tion  F  and  that  b's  are  least  squares  estimators 
of  P's.  Then  the  bootstrap  idea  is  to  approxi¬ 
mate  an  unknown  distribution  G(F)  of  bj-Pj,  by 
G(Fg)  where  Fg  is  the  empirical  distribution  of 
F  for  a  given  sample  data  set  on  the  dependent 
variable  and  its  regressors. 

Now,  consider  a  large  number  of  random 
samples  of  size  n  with  replacement,  drawn  from  a 
box  containing  the  least  squares  residuals  e^, 
Bj,  e  ,  ...  e  .  Suppose  one  such  sample  is  then 


designated  e^,  e*,  ej,  ....  e^^  effectively 
yielding  the  "pseudo  data"  for  y  from  y*  = 

P  * 

I  ti  (i=l,2 . n).  This  "pseudo 

j  =  l 

data"  along  with  the  sample  observations  on  the 
regressors  would  then  constitute  a  set  of  sample 
values  for  the  bootstrap  empirical  distribution. 

In  view  of  the  fact  that  the  least  square 
residual  e's  are  not  independent  even  though  the 
e's  have  this  property  and  that  the  e's  are  a 
bit  smaller  than  the  e's,  the  bootstrap  sampling 
from  e's  can  be  downward  biased.  This  bias  can 
be  reduced  by  scaling  up  the  ' s  by  a  factor  of 
[n/ Cn-p- 1 )  ]  (  see  Freedman  and  Peters,  1984). 

We  used  a  scaling  factor  in  our  bootstrap  re¬ 
sults  reported  below. 

3.2  The  Results 

The  MR  estimates  of  o  •  ^  were  obtained  for 

1 J  2 

three  values  of  the  stickiness  parameter  On  : 
o^  =  10  Oj^  =  10  °  and  Oj^  =  10 

However,  we  shall  present  the  detailed  re¬ 
sults  for  =  10'^  only  in  view  of  space  limi¬ 
tations.  Moreover,  the  hypotheses  that  energy 


and  capital  are  complements  was  found  not  to  be 
violated  when  =  10  ^  and  =  10~®.  The  MR 
estimates  corresponding  to  =  10~®  correspond 
to  the  exact  constraints  case.  The  point  and 
interval  estimates  of  o-  -'s  for  Od  =  10  ^  are 

1 J  K 

given  in  Table  1. 

The  results  in  Table  1  show  that  the  esti¬ 


mates  of  and  Ogjr  are  of  opposite  sign;  thus 
the  energy-capital  can  be  substitutes  instead  of 
complements  when  the  stickiness  parameter  is 
equal  to  =  10~^.  But,  the  asymptotic  stand¬ 
ard  error  of  in  column  3  and  the  bootstrap 
standard  error  in  column  5,  which  are  the  para¬ 
metric  and  non-parametric  measure  of  the  accur¬ 


acy  of 
large . 


the  estimator 
Therefore,  it 


,  respectively 
might  be  worthwhi 


are 

to 


calculate  the  75%  Cl  to  determine  if  the  posi¬ 
tive  value  of  0]^g  is  included  in  the  Cl.  The 
possibility  of  a  positive  value  in  the  Cl  would 


suggest  that  the  hypothesis  of  energy  and  capi¬ 


tal  substitutability  cannot  be  rejected  at  the 


25%  level.  The  75%  CI^  in  column  4  and  75%  CIj^ 
in  column  6  represent  the  parametric  confidence 


intervals  with  SE(o^j)2  and  SECo^jlj^,  respect¬ 
ively.  These  75%  CIs  appear  to  include  a  posi- 


Table  1:  The  Hijced  Regression  Estimates  of  Allen  Partial  Elasticity  of  Substitution 

when  Og  =  10'^ 

(o^)  SE(2i^j)3  75%  CIg  SE(Oij)b  75%  CI^ 

1)  -1.607  0.128  [  1.694,  -1.520]  0.103  [-1.677,  -1.537] 

2)  1.174  0.547  [  0.803,  1.546]  0.476  [  0.851,  1.497] 

3)  1.427  0.828  [  0.865,  1.989]  0.464  [  1.112,  1.742] 

LE 

4)  o,„  0.488  0.132  [  0.398,  0.578]  0.088  [  0.428.  0.548] 

LM 

5)  0.770  0.424  [  0.482,  1.058]  0.336  [  0.542,  0.998] 

6)  -6.391  2.095  [-7.813,  -4.968]  2.141  [-7.844,  -4.938] 

7)  o„_  0.941  3.660  [-1.544,  3.426]  2.761  [-0.934,  2.81f] 

8)  0.322  0.524  [-0.034,  0.677]  0.439  [  0.024,  0.620] 

KM 

9)  o„,  1.009  0.363  [  0.762.  1.255]  0.308  [  0.800,  1.218] 

EL 

10)  o,,,,  -3.444  1.556  [-4.500,  -2.388]  1.617  [-4.542.  -2.346] 

11)  o,,,,  -12.260  3.356  [-14.539,  -9.981]  3.194  [- 14 . 429 .  - 10. 091] 

EE 

12)  0.491  0.431  [  0.198,  0.784]  0.394  [  0.224,  0.758] 

EM 

Notes:  ■ ;  The  point  estimate  of  the  Partial  Allen  elasticity  of  substitution 

of  factor  input  i  with  j. 

SECo^j)^:  The  standard  error  of  o^j  from  the  asymptotic  formula. 

75%  CI^;  The  75%  confidence  intervals  with  SECo  — 1^  and  tg^^  =  .679. 
SECSj^jljj:  The  standard  deviation  of  the  bootstrap  distribution. 

75%  CIjj:  The  75%  confidence  intervals  with  SE(o^j)gi  and  -  .679. 
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tive  value  of  Oj^g.  Hence,  the  hypothesis  that 
energy  and  capital  may  be  substitute  appear  not 
to  be  rejected  by  the  data. 

How  does  the  existence  of  fat-tailed  errors 
affect  the  Cl?  We  investigated  this  question  by 
reestimating  the  j ' s  by  the  MR  method  with 
fatter-tailed  errors.  These  errors  were  obtained 
as  described  below.  Under  the  assumption  that 
residuals  from  (2)  are  normally  distributed,  we 
generated  a  fatter-tailed  error  by  mixing  two 
sets  of  normally  distributed  errors  such  that  a 
proportion  (1-s)  of  original  errors  with  0  mean 
and  covariance  matrix  0=201  were  combined 
with  a  proportion  s  of  another  set  of  errors 
with  0  mean  and  covariance  matrix  dO  where  d  is 
a  scalar  greater  than  1.  The  matrix  of  errors 
so  obtained  are  distributed  with  0  mean  and  co- 
variance  matrix  0(d)  =  (1-s)  0  +  s(dO)  and  these 
errors  are  fatter-tailed  than  the  original 
errors  (see  Flood  et  al.,  1984).  We  used  d=4 
and  s=2/^5  such  that  the  Kurtosis  coefficient  of 
fat-taiJed  errors  is  about  1.64.  The  use  of 
fatter-tailed  errors  resulted  in  somewhat  higher 
SE  (a^  j  )  jjCompared  to  those  in  Table  1.  The  75% 
Cl  were  also  computed  and  again  failed  to  reject 
the  substitutability  of  energy  and  capital  for 

Ojj  =  10  . 

4.  SUMMARY  REMARKS 

In  this  paper  we  examined  the  sensitivity 
of  the  estimates  of  the  partial  Allen  elasticity 
of  substitutions  and  their  confidence  interval'- 
when  homogeneity  and  symmetry  hypotheses  hold 
stochastically  compared  to  exactly. 

A  simple  parameterization  of  the  covariance 
matrix  for  the  disturbance  terra  in  the  stoch¬ 
astic  constraints  was  considered.  It  was  shown 
that  for  some  a  priori  specification  of  Op(i.e., 
Og  =  10"^)  where  has  an  interpretation  as  a 
coefficient  of  stickiness  towards  homogeneity 
and  symmetry  yield  a  positive  estimate  of  Oj-g. 
The  75%  Cl  of  o^g  were  computed  using  both 
asymptotic  SE  and  bootstrap  SE  and  they  failed 
to  exclude  a  positive  value  for  Ogg-  Therefore 
the  hypothesis  that  energy-capital  may  be  sub¬ 
stitutes  cannot  be  lejecLeU  at  the  25%  level. 
These  results  might  be  interpreted  in  terms  of 
the  rationale  given  by  Berndt  and  Wood  (1979)  in 
terms  of  capital  utilization. 

FOOTNOTES 

’  One  reason  these  avenues  have  failed  to 
resolve  the  controversy  might  be  that  their 
studies  use  aggregate  data  which  makes  it  diffi¬ 
cult  properly  to  capture  the  general  equilibrium 
effects  'f  an  energy  price  shock  on  business 
(see  Sole  ^ ,  1987)  . 


’  Estimates  of  the  parameters  of  the  cost 
function  under  stochastic  constraints  can  be 
carried  out  in  either  a  mi*ed  regression  or 
Bayesian  framework.  In  this  paper  we  will  focus 
on  the  MR  approach. 

’  The  Mixed  Regression  model  is  a  conven- 
ien'.  econometric  technique  for  combining  in- 
forration  from  a  given  sample  with  prior  non¬ 
sample  stochastic  information  with  a  view  to 
obtanine  a  more  efficient  estimate  of  regres¬ 
sion  Coefficients.  The  MR  model  has  proven  to 
be  useful  if  j-udged  solely  by  the  plausibility 
of  the  results  obtained  from  it,  although  the 
assumptions  it  is  based  on  are  somewhat  logic¬ 
ally  flawed  (see  Zellner,  1975).  This  method  was 
originally  proposed  by  J.  Durbin  in  1953  and 
later  developed  more  fully  by  Theil  and  Gold- 
berger  (1961)  on  heuristic  grounds.  However,  it 
can  also  be  interpreted  as  a  Bayes  estimator  and 
has  been  applied  in  areas  of  consumer  demand 
(e.g.  ,  see  Paulus,  1975)  and  cost  functions 
(e.g.,  see  Illmakunnas,  1986  and  references 
therein) . 

*  Efron  (1982)  has  provided  some  evidence 
for  the  relative  performance  of  the  jackknife 
and  bootstrap  methods.  He  found  that  while  both 
jackknife  and  bootstrap  standard  errors  provide 
an  almost  unbiased  estimate  of  the  parameters, 
the  bootstrap  method  has  a  lower  coefficient  of 
variation  than  the  jackknife  method. 
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Dimensionality  Constraints  on  Projection  and  Section  Views 

of 

High  Dimensional  Loci 


George  W.  Furnas 
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Abstract 

Fundamental  limitations  are  presented  for  two  general  graphical  techniques  for  constmcting  geometric 
views  of  high-dimensional  loci,  projection  and  section .  Projections  can  only  easily  display  aspects  of 
structure  that  are  of  low  dimensionality .  Sections ,  i.e,  intersections  of  affine  subspaces  with  a  locus, 
can  easily  display  stmcture  of  only  low  co -dimensionality  (and  hence  high  dimensionality).  However, 
compositions  of  section  and  projection  can  display  aspects  of  structure  of  any  intermediate 
dimensionality.  These  assertions  are  proven  for  fundamental  idealization  of  loci  that  are  arbitrary  affine 
subspaces  of  a  high-dimensional  space.  The  issues  introduced  by  finite  extent,  by  curvature,  by 
quantization  and  by  error  noise  are  then  discussed,  basically  in  terms  of  notions  of  scale.  Examples  of 
using  the  composition  technique  are  given,  examining  the  structure  of  two  high-dimensional  objects 
embedded  in  a  six-dimensional  space. 


1.  Introductioo 

The  investigation  of  high  dimensional  loci  arises  in  both 
mathematics  and  statistics.  In  mathematics,  sets  of  equations 
and  inequalities,  or  computational  procedures  can  define 
mathematical  objects  of  high  dimensionality.  Graphics 
provides  one  set  of  tools  to  augment  algebraic  attempts  to 
understand  the  structure  of  these  objects.  The  typical 
graphical  approach  is  to  make  various  2-dimensional 
projections  and  sections  of  the  locus,  from  which  some  sense 
of  its  structure  is  obtained.**'  In  statistics,  multivariate 
data  form  high  dimensional  point-clouds  whose  structure 
must  be  detected  and  modeled.  Again  graphics  are  playing 
an  increasing  role  in  augmenting  parametric  characterization 
of  the  structure  of  such  loci,  particularly  in  the  exploratory 
stages  of  data  analysis*'*'  *^'  **' *^'  **'  *^'  ****'  ****  **^'.  Though 
statisticians  sometimes  use  various  glyph  variation  schemes* 
for  graphical  presentation  of  multivariate  data  (e.g., 
Chemoff  Faces'*^',  trees  and  "castles"**'**),  geomeuic 
transformations,  usually  projection,  are  also  used  to  produce 
two-dimensional  renditions  of  high  dimensional  loci  (e.g.,  *'** 
*8*  (iilj  jfjjj  paper  represents  a  basic  attempt  to  understand 
the  theoretical  power  of  views  generated  by  such  geometric 
transformations. 

1.1  A  Modvating  Example:  The  4-poiiiC  Ultrametric  Locus 

The  inherent  limitations  of  low-dimensional  projections 
can  be  illustratied  by  the  4-point  Ultrametric  Locus,  a 
particular  mathematically  defined  locus  embedded  in  6- 
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1.  Such  schemesuse  some  "glyph",  such  as  an  iconic  face,  whose  vari¬ 
ous  graphical  features  (e.g.,  aspect  ratio  of  the  face,  size  of  eyes,  etc.) 
are  parameterized  and  associated  with  variables.  Thus  a  set  of  points 
becomes  a  family  of  glyphs. 


space,  (This  locus  arises  in  efforts  to  understand  the  families 
of  distance  matrices  satisfying  different  metrics,  e.g.,  general 
metric,  euclidean,  ultrametric.  *'^*  Its  importance  here  is 
simply  that  it  is  an  interesting  locus  in  six-space.) 

Consider,  for  four  points  A,  B,  C  and  D,  the  vector  of  six 
pair-wise  distances  between  them; 

(«,  v.w,x,y,z)  =  (dAB ,  dAC ,  dBC ,  dAD .  dBD .  dco)- 

Of  all  possible  (non-negative)  sextuples  of  such 
distances,  consider  only  those  that  correspond  to  distances 
satisfying  the  Ultrametric  Inequality: 

dij  S  max  (dik ,  d^t )  i  J  (A .  B  ,C ,  D } 

Ultrametric  distances  are  interesting  because  there  is  a 
one-to-one  correspondence  between  such  sextuples  of 
distances  and  hierarchical  clusterings  of  objects  A,  B,  C  and 
D,  or  equivalently  rooted  ultrametric  trees  with  these  four 
objects  as  their  leaves.  TTius  understnding  this  loc’is 
amounts  to  understanding  the  complete  set  of  Ultrametric 
trees  on  4-points. 

The  set  of  ultrametric  sextuples  forms  a  locus  (UM- 
Locus)  of  some  type  embedded  in  the  six  dimensional  space 
of  all  sextuples.  For  various  algebraic  reasons,  this  locus 
was  known  to  have  interesting  structure.  To  get  a  better 
sense  of  it  in  detail,  one  might  try  to  "look"  at  it  using  a 
powerful  high-dimensional  rotation  and  projection  system, 
such  as  The  Data  Viewer,  developed  by  Andreas  Buja  and 
his  colleagues  **^'  for  looking  at  high  dimensional 
multivariate  point-clouds.  To  do  this,  a  point-cloud 
representation  of  the  locus  was  created  by  generating  and 
testing  each  point  in  the  six-dimensional  unit  hypercube 
whose  coordinates  were  multiples  of  0.10.  Points  on  this 
grid  that  satisfied  the  Ultrametric  Inequality  were  collected, 
and  the  rest  ignored.  The  resulting  six-dimensional  point- 
cloud  was  then  entered  into  The  Data  Viewer  which  then 
dynamically  rotated  the  locus  and  generated  a  continuous 
moving  sequence  of  two-dimensional  projections.  One  such 
projection  is  shown  in  Figure  1. 
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Figure  I.  2-Dimensional  projection  of  the  6-dimensional.  ■4-Point  Ultrametric  Locus’ 


The  critical  feature  of  Figure  1  is  that  it  shows  essentially 
nothing  interesting!  The  only  visible  aspects  of  the 
structure  are  artifactual;  e.g.,  the  edges  and  comers  seen  in 
the  figure  are  the  edges  and  comers  of  the  hypercube  that  the 
locus  was  sampled  from  and  not  special  features  intrinsic  to 
the  structure  itself. 

Here  is  a  graphical  tool  used  frequently  by  statisticians, 
sometimes  with  marked  success,  to  look  at  high  dimensional 
loci.  Yet  for  this  locus,  which  is  known  to  have  interesting 
structure,  projection  shows  nothing.  The  work  presented  in 
this  paper  represents  an  attempt  to  understand  what  is 
happening  in  Figure  1.  by  addressing  a  simple,  though 
fundamental  idealized  case. 

1.2  The  Two  Geometric  Viewing  Techniques 

We  will  actually  investigate  two  common  geometric 
techniques  for  deriving  a  low  dimensional  picture  of  a  higher 
dimensional  locus:  projeciion  and  section  . 

By  a  k  -projection  we  will  mean  an  orthogonal 
projection  of  a  high  dimensional  locus  embedded  in  n  -space 
onto  a  k -dimensional  affine  subspace  (e  g.,  onto  a  line, 
plane,  or  general  k -dimensional  hyperplane,  not  necessarily 
through  the  origin.  Orthogonality  is  with  regard  to  the 
canonical  inner  product  on  M''.)  Most  typically,  this  means 
projecting  from  n  -space  onto  some  2-dimensional  plane  at 
some  orientation  in  the  n  -space.  This  2-projection  is  used  as 


a  2D  graphic,  i.e.,  a  picture  on  paper  or  in  a  video  display. 

By  a  k -section  we  will  mean  the  intersection  of  a  ;t- 
dimensional  affine  subspace  with  the  high  dimensional  locus 
residing  in  n  -space.  A  2-section  arising  from  intersecting 
some  plane  in  the  n  -space  with  the  locus  can  be  presented  as 
a  sort  of  cross-sectional  picture  of  the  locus. 

Although  for  simple  graphics  k=2,  interest  Ln  the  general 
case  of  k>2  is  not  just  theoretical.  There  are  ways  to  present 
graphics  that  are  more  than  2-dimcnsional,  e.g.,  using  stereo 
presentation,  color  and  motion/time  (e.g., 
hI  (16]  y  Yhe  results  that  follow  should  pertain  to  these 
higher-dimensional  graphics  as  well.  Also,  as  will  be  seen,  it 
will  be  useful  to  consider  the  composition  of  section  and 
projection  operations  of  various  dimensionalities,  and  their 
net  effect  can  only  be  understood  by  considering  the  general 
case. 

2.  The  Affine  Subspace  Idealization 

Imagine  that  a  demon  opponent  pre.sents  an  investigator 
.1  n  -dimensional  black  box  that  has  a  target  object  embedded 
in  it,  and  challenges  the  investigator  to  use 
geometric/graphical  techniques  to  discover  what  is  inside. 
The  demon’s  goal  is  to  put  in  something  hard;  the 
investigator’s  -  to  figure  what  is  there  anyway.  This  fully 
general  problem  is  exceedingly  difficult,  so  we  consider  here 
a  fundamental  simple  case:  Supptose  we  allow  the  demon  to 


put  only  certain  very  simple  high-dimensional  loci  in  the 
box;  flats.  By  a  flat,  we  will  mean  an  arbitrary  affine 
subspace:  a  point,  line,  plane,  or  hyperplanes  (not 
necessarily  through  the  origin).  In  particular,  an  m  -flat  will 
mean  an  m -dimensional  affine  subspace  embedded  in  n- 
space. 

Note  that  these  special  loci  differ  from  loci  of  practical 
interest  in  several  ways.  They  are  infinite  in  extent  and  high 
translational  symmetry  (a  line  looks  the  same  everywhere 
along  its  length).  In  addition,  unlike  statistical  loci  (and 
some  mathematical  ones  in  theory,  and  many  in 
computational  practice),  they  are  continuous.  In  this 
difference  resides  the  idealization;  and  we  will  try  to  return 
across  this  gap  at  the  end.  In  any  case,  flats  are  sufficiently 
primitive  and  fundamental  objects  that  understanding  their 
behavior  has  value  in  its  own  right. 

Accepting  for  the  while  this  restriction,  the  situations  is 
thus;  the  demon  will  put  some  target  m  -flat  in  the  n  -space, 
and  the  investigator  will  try  to  use  /(-projection  or  /; -section 
to  look  at  what  is  there.  What  will  the  investigator  see? 

2.1  Constraiats  on  Projection  Views 

Consider  first  the  case  of  projection ,  i.e.,  "How  does  an 
m  -flat  appear  in  a  /( -projection?"  The  answer  turns  out  to  be 
quite  simple;^ 

The  operation  o/k-projection  will  yield  an  image  of 
the  m-flat  that  almo.st  surely 

•  preserves  dimensionality  of  a  m-flat  (i.e.,  m-flat 
in  n  -space  ==>  m-flat  in  k -space),  when  m  <  k, 
and 

•  is  of  full  k  dimensionality  when  m>k  (thus 
indistinguishably  covering  the  k -dimensional 
\iewing  space). 

Thus  for  example,  a  point  (0-flat)  in  3-space  always 
appears  as  a  point  (0-flat)  in  a  2-projection.  Thinking  of  the 
ca.sting  of  a  shadow  as  a  projection,  recall  that  the  shadow  of 
a  point  is  a  point,  regardless  of  its  position  in  3-space.  It  is 
likewise  true  in  n -space.  Similarly,  a  line  (1-flat)  in  3-space 
will  almost  surely  appear  as  a  line  in  a  2-projection;  the 
shadow  of  a  line  is  almost  surely  a  line.  The  italicized 
phra.se,  almost  surely ,  is  being  used  in  the  technical 
(meas-.ire  theoretic)  sense.*  That  is,  for  example,  it  is 
possible  for  a  line  (1-flat)  to  2-project  not  into  a  line  (1-flat) 
but  into  a  point  (0-flat).  However,  this  can  happen  only  in  the 
singular  case  that  line  is  perpendicular  to  the  2-flat  used  for 
the  projection  (the  viewing  space).  This  singular  case  has 
measure  zero  (i.e.,  zero  probability  if  flats  are  chosen 
randomly),  and  hence  almost  surely  the  2-projection  of  a  1- 
flat  is  a  1-flat. 

2.  Ptoo(\  of  Ok  almost  surely  isseiuoas  abouldimeusionalily  of  pro¬ 
jections  and  sections  will  be  published  elsewhere,  and  an:  also  avail¬ 
able  in 

3  The  ulmnsi  surely  sialemenls  here  require  only  that  underlying  proba¬ 
bility  distnbution.s  be  absolutely  conunuous  w  r.t  the  Lebesque 
mea.sute  on  the  corresponding  natural  euclidean  parameter  spaces. 
For  example  coordinates  of  the  nxln-p)  matnx  delimng  a  p- 
dimeasional  linear  subspace  could  be  sampled  from  the  standard 
.sphencal  multivanate  normal  on  i.  Sec  the  proofs  for  dciaifs 


The  projection  operation  cannot  preserve  dimensionality 
of  a  target  m-flat  if  m  gets  so  large  that  it  exceeds  the 
dimensionality  of  the  viewing  space.  Illustrating  this  case 
where  m^,  note  that  a  plane  (2-flat)  in  3'Space  will 
almost  surely  2-project  onto  the  whole  projection  plane.  The 
whole  3-space  (3-flat)  will  also  2-project  to  cover  the  whole 
plane.  The  2-projection  alone  cannot  distinguish  a  2-flat 
target  from  a  3-flat  one. 

Thus  if  the  demon  puts  a  p'int  or  a  line  in  the  box,  the 
investigator  can  easily  disclose  it  with  an  arbitrary  2- 
projection,  and  thereby  win.  However  if  the  demon  sets  as  a 
target  a  higher  dimensional  m  -flat,  all  2-projections  will  be 
completely  and  indistinguishably  covered. 

A  second  look  at  the  Ultrametric  Locus  of  Figure  1  bears 
out  the  results  just  given.  The  projection  made  visible  only 
0-dimensional  features  (jxiint-like  comers)  and  1- 
dimensional  features  (line-like  edges)  of  the  locus. 
Unfortunately  these  were  artifactual  aspects  of  the  locus. 
The  interesting  structure  apparently  was  in  the  higher 
dimensionality,  and  to  the  demon's  gratification,  was  self- 
obscured  by  the  projection  operation. 

This  means  that  projection  is  a  powerful  technique  for 
identifying  low  dimensional  affine  substructures  in  high 
dimensional  space,  but  almost  surely  useless  in  finding 
higher  dimensional  ones.  Put  another  way,  if  the  affine 
structure  of  interest  is  of  low  dimensionality,  essentially 
ANY  projection  will  show  it  clearly.  If  it  is  of  high 
dimensionality  (where  "high"  is  often  only  m  >2,  since 
typical  projections  are  2D),  only  very  singular  projections 
will  show  it.  It  is  the  .struggle  against  this  almost  surely 
condition  that  makes  the  pursuit  of  informative  projections 
(e.g.,  in  Projection  Pursuit  I**' )  so  difficult. 

2.2  Constraints  on  Section  Views 

Fortunately  for  the  mvestigator,  the  second  tool  available 
for  creating  low-dimensional  views,  section,  has  a 
complementary  power.  Considering  the  case  of  section  .  we 
ask,  "How  does  an  m  -flat  appear  in  a  I’  -section?  " 

The  answer  to  this  requires  the  notion  of  co -dimension  . 
The  co-dimension  of  a  flat  is  the  complement  of  its 
dimensionality  with  respect  to  that  of  the  full  space.  That  is. 
in  n  -space,  the  co-dimensionality  of  a  m  -flat  is  defined  to  be 
i=n-m.  Thus  the  co-dimensionality  of  a  plane  in  3 -space  is 
(3-2)=l,  that  of  a  point  in  a  plane  is  (2-0)=2. 

Whereas  the  effect  of  projection  was  put  simply  in  terms  of 
dimension,  the  effect  of  .section  is  put  simply  m  terms  of  co- 
dimension; 

The  operation  o/k-section  will  yield  an  Image  of  an 

m-flat  that  almost  surely 

•  preserves  the  co-dimensionality  of  a  m-flat  (i.e.. 
(n-i)-flat  in  n-space  =  =  >  <k-il-flat  In  k-space), 
when  (n-m)  <  k,  and 

•  is  empty  (le  .  indiscriminately  missing  m -flats) 
when  (n-m)  >  k 

Let  I  be  the  co-dimension  of  the  m -fiat,  i.e.,  m=(n-i). 
If  i=(n-m  l<k  .  then  the  (n-i  )-rtat  almost  sure Iv  appears  as 
a  (k-i  )-flat  in  the  L -section,  Titus  for  example,  in  3-,space  a 
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line  [=(3'2)-flatl  almost  surely  appears  as  a  point  [=(2-2)- 
flat]  in  an  arbitrary  2-section.  A  plane  t=(3-I)-flat| 
almost  surely  appears  as  a  line  [=(2-l)-flatj  in  an  arbitrary 
2-section, 

On  the  other  hand,  if  i=(n-m)>lc ,  it  almost  surely 
disappears  from  the  -section.  Thus  in  three  space  an 
arbitrary  2-section  will  almost  surely  miss  a  target  point 
[=(3-3)-flat].  It  will  reveal  the  point  only  under  the  singular 
condition  that  the  viewing  plane  happens  to  be  positioned 
and  oriented  so  as  to  pass  through  the  point. 

Thus  if  the  demon  puts  a  flat  of  co-dimension  0,  I ,  or  2, 
(dimensionality  n .  n  —  l  or  n-2)  in  the  box,  investigator  can 
easily  disclose  it  with  an  arbitrary  2-section.  But  if  the 
demon  sets  a  target  of  higher  co-dimension,  all  the 
investigators’  2-sections  will  almost  surely  miss.  So. 
whereas  projection  is  a  powerful  technique  for  identifying 
structure  of  low  dimension,  section  is  useful  for  finding 
stmcmre  of  low  co-dimension. 

2.J  Complementarity  and  Composition 

These  previous  properties  of  projection  and  section  are 
summarized  in  Figure  2  for  flats  in  6-space,  the 
dimensionality  of  the  black  box  containing  the  Ultrametric 
Locus  of  Figure  1.  If  the  affine  structure  of  interest  is  of  low 
co-dimensionality,  es.sentially  ANY  projection  will  show  it 
clearly.  If  it  is  of  low  co-dimensionality  essentially  ANY 
section  will  show  it.  Thus  these  two  patterns  of  strengths  are 
complementary.  However,  even  the  union  of  the  two 
techniques  is  still  limited.  Given  that  one  seeks  only  two- 
dimensional  pictures,  so  that  k=2,  projection  can  find 
substructures  of  dimensionality  0  and  1,  and  section  can  find 
dimensionality  n ,  n-1,  and  n-2.  In  cases  where  n<4,  this 
covers  all  the  cases.  But  for  larger  n  ,  there  is  a  gap  between 
the  low  dimensional  and  low  co-dimensional  extremes,  and 
so  the  demon  can  still  win. 


6-Space  2-D  Image 
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Figure  2.  The  joint  capabilities  of  section  and 
projection. 
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Fortunately,  composition  of  these  techniques  can 
completely  bridge  the  gap.  For  example,  consider  the 
problem  of  finding  a  3-flat  in  6-space,  using  a  k=2 
dimensional  viewing  space.  Neither  single  approach  will 
find  it:  Since  m=2  >  k=2,  it  will  almost  surely 
indiscriminately  cover  2-projections.  and  since 
/I -m  =6-3=3  >  k=2,  it  will  almost  surely  not  appear  in  2- 
sections.  How  can  an  informative  2D  view  be  created? 

Following  Figure  3,  note  that  a  4-section  of  the  6-space 
will  almost  surely  contain  an  image  of  the  3-flat,  since 
n-m  =6-3=3  <  L=4,  Since  section  preserves  co-dimension, 
the  (6-3)-flat  will  become  a  (4-3)-flat  [=l-flatl  in  the  4- 
section.  Thus  we  have  a  section  that  at  least  contains  some 
image  of  the  target.  The  problem  is  that  a  4-section  is  not  a 
2D  picture.  That  is  easily  solved  by  taking  a  2-projection  of 
the  4-section.  The  4-section  is  now  a  new  4-dimensional 
black  box  with  a  1-flat  target  in  it.  Correspondingly  let 
n'^.  m'=l  and  k'=l  Thus,  since  m'=l<k'=2, 
dimensionality  will  be  preserved  by  the  projection,  yielding 
a  clearly  visible  1-flat  (line)  in  the  final  image.  That  is,  if  the 
investigator  takes  a  2-dimensional  projection  of  a  4- 
dimensional  section  of  a  fi-dimensional  space,  and  sees  a 
line,  she  has  just  found  a  3-flat. 


6-Space  4-Section  2-Projection 


Figure  3.  Effects  of  a  4-section  followed  by  a  2- 
projection,  and  the  remaining  gap. 


By  similar  combinations,  the  investigator  can  reveal  an 
arbitrary  m-flat.  E  g.,  it  will  almost  surely  appear  as  a  line 
in  a  2-projection  of  a  (n-m-(-l)-section.  Equivalently,  the 
m  -flat  target  could  be  revealed  by  an  alternative 
composition,  taking  a  2-section  of  a  (m -1  )-projection.  In 
either  case  the  investigator  can  now  always  beat  the  demon. 

It  should  be  stressed  that  when  used  as  suggested  by  the 
constraints  discussed  here,  m-flat  structure  can  be  found 
WITHOLiT  SEARCH  through  the  orientation  and  location 
piuameters  of  the  section  and  projection  operations.  The 
"almost  surely "  considerations  mean  that  sections  and 
projections  of  arbitrary  positions  and  orientations  should 
yield  the  desired  result  One  must  only  examine  at  most  n  'k 
k-dimensional  views.  Each  corre.sponds  different 
dimensionalities  of  the  initial  j -section 

(j  =  n  .  n  ~k  .  n -2k  .  n -?<k  ,  .k  ),  which  precedes  the  final  k  - 
projection  in  the  composite  strategy. 
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3.  Two  Examples 

The  previous  theory  is  based  on  the  idealization  that  the 
demon  can  only  use  flats  as  targets,  yet  real  loci  can  deviate 
from  this  idealization  in  many  ways.  Before  considering 
such  deviations  in  detail,  we  will  first  present  some  examples 
to  show  that  the  technique  holds  promise  even  for  real  loci. 

Views  of  the  loci  in  this  section  were  again  generated 
using  The  Data  Viewer  to  do  2-D  projections  and 
systematically  using  its  "bmshing"  facility  compositely  to  do 
sections.  (Note  that  a  -dimensional  bmsh,  i.e., 
conditioning  on  a  linear  combination  of  p  variables  leaves 
n-p  free  to  vary,  creating  a  (n-p  )-dimensional  section). 

3.1  Example  1;  The  4-point  Ullramelric  Locus  revisited 

In  this  example  the  composition  of  section  and  projection 
Ls  used  to  get  a  more  informative  display  of  the  Ultrametric 
Locus.  An  arbitrarily  oriented  2-projection  of  the  full  locus 
was  presented  in  Figure  1 . 


Figure  4.  A  2-projection  of  a  4-section  of  the 
Ultrametric  Locus. 

Figure  4  presents  a  2-projection  of  a  4-section  through 
the  locus  (the  same  2-projection  as  in  Figure  1 ;  so  Figure  4  is 
actually  embedded  in  Figure  1.)  The  result  is  a  5-segment 
tree  structure.  Essentially  all  such  sections  have  this 
structure  (Figure  5  shows  another  completely  different 
section  and  projection.)  These  frees  are  made  up  of  1-flat 
pieces. 


Figure  5.  Another  2-projection  of  a  4-section  of  the 
Ultrametric  Locus. 

Working  backwards  through  Figure  3,  we  can  see  that  a 
1  -flat  image  in  a  2-projection  of  a  4-section  corresponds  to  a 
3-flat  in  the  embedding  6-space.  That  is,  the  locus  is  an 
articulated  tree-like  collection  of  3-flat  pieces.  Funher 
investigations  can  show  that  there  are  three  distinct  though 
connected  sets  of  these  5-segment  images. 

All  of  these  results  are  consistent  with  what  is  known 
about  the  set  of  ultrametric  distances  on  four  points.  It  has 
been  mentioned  that  such  distances  correspond  to  distances 
in  rooted  binary  trees  on  four  points.  There  are  1 5  different 
such  binary  tree  topologies,  one  associated  with  each  of  the 
segments  of  the  three  5-segment  shapes  in  the  figures.  Each 
of  the  15  tree  topologies  has  three  continuous  parameters 
that  affect  distance:  the  distance  matrix  is  altered  in  a 
continuous  fashion  by  changing  the  heights  of  the  three 
internal  nodes  of  the  rooted  binary  tree.  This  explains  the 
local  three-dimensionality  of  the  locus  as  revealed  in  the 
line-like  appearence  in  the  2-projection  of  the  4-section  of 
Figures  4  and  6.  The  composition  of  section  and  projection 
yields  a  powerful  look  at  this  articulated  high  dimensional 
object,  even  though  it  is  not  simply  a  flat. 

3.2  Example  2;  A  three  dimensional  toms  in  6  dimensional  space 

The  second  example  examines  a  curved  object,  a  three 
dimensional  toms  in  6  dimensional  space.  Such  a  toms  is 
simply  the  Cartesian  product  of  three  circles.  I.e.,  the  set  of 
sextuples  («  ,  v ,  w ,  x ,  y ,  z  )  such  that 

u2  +  v2  =  1 
w2  =  1 
y2  -t-z2  =  1. 


Note  that  these  three  equations  define  a  3-manifold 
embedded  in  6  space.  This  continuous  object  was  turned  into 
a  point-cloud  by  taking  10  points  around  each  circle.  Tlie 
Cartesian  product  thus  yielded  1000  points.  Figure  6  shows 
four  simple  2-projections  of  the  resulting  toroidal  cloud  in  6 
dimensional  space.  Note  that  beyond  a  general  curved 
convex  appearance,  the  special  character  of  the  stmeture  is 
obscured  in  these  simple  projections. 
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Figure  7.  Four  2-projections  of  4-sections  of  the  3-torus  in  6-space. 


104 


Figure  7  shows  four  corresponding  2-projections-of-4- 
sections  of  the  torus  in  6  dimensional  space.  The 
fundamental  circular  structure  is  clearly  visible  and. 
referring  back  to  the  diagram  of  Figure  3,  the  appearence  of 
the  locus  as  curves  in  the  viewing  plane  evidences  its  local 
3-dimensionality. 

4.  Deviations  ft-om  the  Idealizatioa 

The  previous  two  examples  illustrate  how  the  earlier 
ideal  theory  seems  to  extend  to  less  ideal  cases:  both  the 
limitations  of  sections  and  projections  and  the  possible 
ptower  of  their  composition  are  manifest.  In  both  of  these 
examples,  as  in  many  real  situations,  the  high  dimensional 
loci  of  interest  differ  in  many  ways  from  the  idealized  case 
of  affine  subspaces.  In  this  section  several  of  these 
deviations  are  discussed,  with  the  principal  conclusion  that 
once  a  suitable  level  of  scale  is  determined,  ideal  results 
remain  useful,  thanks  to  the  robusmess  of  linear 
approximations.  The  treatment  here  is  casual  and 
conjectural;  other’s  efforts  at  formalization  would  be 
welcome. 

4.1  Finite  Extent 

Many  mathematical  loci  and  (presumably)  all  empirical 
statistical  ones  are  bounded  in  extent.  To  begin  to 
understand  the  implications  of  boundedness,  consider  first 
the  simple  case  of  bounded  pieces  of  m -flats.  Since  real 
viewing  windows  (paper  or  CRT  screens)  are  also  bounded, 
the  relative  scale  of  the  target  and  viewing  bounds  is 
imptortant.  In  particular,  on  sufficiently  large  scale  (i.e.,  the 
target  object  is  much  smaller  than  the  window),  an  m  -flat- 
piece  becomes  p>oint-like  and  projection  will  show  it.  On  a 
sufficiently  small  scale  in  its  neighborhood,  the  m  -flat-piece 
becomes  like  an  m-flat.  Then  the  previous  techniques  of 
section  and  projection  should  work  as  described.  Thus  the 
viewing  process  requires  an  additional  tool  which  can 
rescale  the  object  with  respect  to  the  window  size. 
Projection  is  first  used  at  large  scale  to  locate  the  object  as  a 
pjoint-like  entity.  Then  scale  is  reduced  while  staying 
centered  on  the  object  until  the  object  looms  large  with 
respect  to  the  window  bounds,  whereupton  section  and 
projection  can  be  used. 

42  Non-Linear  Loci 

42.1  Manifolds 

Like  the  second  example  above,  many  interesting  loci 
are  manifolds  that  are  curved,  not  flat.  Technically,  however, 
any  manifold  appears  increasingly  flat  when  viewed  more 
and  more  locally.  Thus  the  earlier  results  should  hold  with 
respject  to  the  image  of  all  local  regions  under  section  and 
projection.  For  example,  the  image  of  a  2-manifold  in  3- 
space  under  2-section  should  almost  surely  be  locally  1  - 
dimensional.  But  something  which  is  locally  1 -dimensional 
is  a  1 -manifold.  Thus  the  results  should  generalize  to 
manifolds,  with  a  further  caveat:  The  almost  surely 
condition  here  means  that  sometimes  there  could  occasional 
local  alteration  in  dimensionality  -  singularities  can  be 
introduced. 

For  a  simple  example,  consider  how  best  to  tell  a  hollow 
sphere  from  a  solid  one.  One  is  a  2-manifold  in  3-space,  the 


other  a  piece  of  a  3-flat  in  3-space.  Since  the  distinction 
between  these  is  not  one  of  low  dimensionality;  they  will 
look  the  same  in  projection.  They  differ  with  respject  to  low 
co-dimensionality,  so  only  section  ca..  distinguish  them. 
One  will  apptear  as  a  ring  (1 -manifold)  the  other  as  a  disc 
(piece  of  a  2-flat).  One  could  by  similar  means  distinguish 
hollow  and  solid  hypjer-spheres.  Topiographic  maps  are 
another  interesting  examples  of  the  implicit  application  of 
this  theory.  The  surface  of  a  piece  of  terrain  is  essentially  a 
2-dimensional  manifold  in  3-space.  Its  structure  is  of  co¬ 
dimension  I  and  cannot  be  conveyed  in  a  2-D  map  by 
projection.  Instead  topxjgraphic  contours,  i.e.,  a  family  of 
2-sections,  display  its  shapte  -  as  curves  (stmctures  of  co¬ 
dimension  1)  in.the  image  plane. 

42.2  Hyper-surfaces  with  singularities 

The  investigation  of  general  m -surfaces,  i.e.,  surfaces 
that  may  have  singularities  is  more  problematic.  First  note, 
though,  that  singularities  are  structures  of  lower 
dimensionality,  p,  where  p<m.  If  one  can  partition  the 
structure  a  priori  into  singularity-substructures  by 
dimensionality,  then  each  dimensionality  can  be  examined 
according  to  the  preceding  treatment  of  manifolds.  If  no  such 
partition  is  available  a  priori ,  the  situation  is  more  difficult. 
The  problem  is  that  structures  of  different  dimensionality  are 
present  at  the  same  time  and  can  obscure  each  other. 
Singularities  of  dimension  p=m-l  can  be  seen  along  with 
the  m  -surface  by  section  and  projection.  They  will  appear  as 
0-dimensional  singularities  on  a  1-manifold,  e.g.,  like  a 
cusp-point  on  a  bifurcating  curve.  However  if  p  <m-l,  the 
singularities  will  either  be  lost  by  it -sections  if  k  is  small 
enough  to  clearly  reveal  the  m -structure,  and  obscured  by 
the  over-projecting  m  -structure,  if  k  is  large  enough  not  to 
miss  the  p -structure.  A  simple  example  of  losing  the 
singularities  is  the  inability  to  see  the  exact  location  of 
mountain  peaks  (singularities  of  co-dimension  p=3=m-2)  in 
a  topographic  map.  The  2-scctions  generating  the  contours 
almost  surely  miss  the  exact  peak  location  —  hence  the  need 
for  a  sptecial  map  symbol  to  mark  them. 

42 J  Intersections  and  Unions 

Some  objects  are  defined  by  the  intersections  and  unions 
of  simpler  loci.  Convex  polytopes,  for  example,  are  defined 
by  bounding  linear  pieces.  We  simply  note  that  the  resulting 
boundaries  and  joints  may  be  thought  of  as  singularities  and 
understood  as  in  the  previous  subsection.  Note  that  usually 
these  singularities  are  cleanly  nested  by  dimensionality,  and 
may  possibly  be  teased  apart  into  a  partition  a  priori . 
Despite  the  problems  of  seeing  all  levels  of  singularities,  the 
first  example  in  the  previous  Examples  section  shows  the 
usefulness  of  applying  section  and  projection  to  a  structure 
made  up  of  quite  a  few  jointed  flat  pieces. 

43  Quantization 

Many  objects  of  interest  are  not  piecewise  continuous, 
but  arc  made  up  of  collections  of  isolated  points.  This  is  the 
typical  ca.se  in  statistics,  where  empirical  multivariate 
distributions  are  made  up  of  a  set  of  observed  data  points.  It 
is  also  typically  true  for  computer  renditions  of  continuous 
mathematical  objects:  the  object  is  approximated  by 
quantized  sample.  The  point-cloud  composition  of  such  loci 
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is  no  problem  for  projection,  since  it  will  preserve  the  image 
of  points.  (The  projection  of  a  point-cloud  is  still  a  point- 
cloud.)  It  is  a  problem  for  section,  however,  since  a  random 
section  will  almost  surely  miss  all  points  in  a  finitely  dense 
sample.  There  are  several  possible  solutions  that  may  work 
in  various  circumstances. 

The  first  approach  is  to  select  sections  carefully,  so  as  to 
insure  that  they  go  through  points.  This  might  be  possible, 
for  example  if  the  locus  is  generated  by  sampling  in  a 
regular  grid.  This  was  the  solution  used  for  the  example  loci 
of  the  previous  section.  Caution  is  needed,  however,  since 
some  such  convenient  sections  may  be  singular  (i.e.,  w.r.t. 
the  almost  surely  conditions).  Also  new  aliasing  artifacts 
may  be  introduced,  since  the  section  operation  will  amount 
to  a  yet-more-sparsely  sampled  version  of  the  true  section 
image. 

A  second  solution  available  in  some  mathematical  cases, 
is  to  generate  the  section  loci  explicitly.  That  is,  instead  of 
generating  a  sampled  version  of  the  full  object  and  trying  to 
section  it,  it  may  be  possible  to  first  specify  the  parameters- 
of  a  sectioning  hypetplane,  and  then  explicitly  generate  a 
version  of  the  object  exactly  as  it  intersects  that  hyperplane. 

A  third  solution  is  to  try  to  "smooth"  the  locus  in  some 
sense,  that  is  by  interpolating  between  points  in  some  local 
region  to  make  a  continuous  approximation  which  can  be 
treated  directly. 

A  fourth  solution  is  to  make  "fat"  points,  i.e.  make  the 
points  in  the  cloud  spheres  of  finite  radius,  so  there  is  a  finite 
chance  of  hitting  them  with  a  section. 

A  final  solution  involves  taking  "thick"  sections,  i.e., 
ones  with  finite  volume  so  that  they  can  intersect  some  of  the 
points.  A  "thick"  section  would  capture  all  points  in  the 
locus  within  some  distance,  5,  of  the  sectioning  hyperplane. 
This  would  be  accomplished  by  intersecting  the  locus  with  a 
generalized  cylinder,  the  Cartesian  product  of  a  m  -flat  and  a 
(n  ~m  )-sphere  of  radius  6,  and  projecting  the  intersection  set 
onto  the  m-flat .  It  is  the  final  projection  operation  that 
maintains  the  visibility  of  the  points. 

The  success  of  any  of  these  methods  depends  on  scale ; 
the  scale  of  quantization  must  be  sufficiently  small  w.r.t. 
scale  of  meaningful  structure.  This  will  help  prevent 
aliasing  problems  in  all  the  methods.  It  keeps  the  notion  of 
neighbors  simple  for  smoothing.  With  thick  sections,  it  is 
what  may  make  it  possible  to  find  a  thickness,  5.  that  is 
sufficiently  larger  than  scale  of  quantization  that  slices  will 
not  usually  miss  points,  yet  smaller  than  scale  of  strucnue 
so  that  the  thickness  will  not  blur  the  global  structure.  Of 
course  if  the  scales  of  quantization  and  structure  are  too 
close  together,  then  there  are  intrinsic  limits  on  the  adequacy 
of  the  rendition  in  the  full  space.  How  much  more  latitude  is 
needed  for  section  and  projection  is  not  yet  clear. 

4.4  Noise 

A  final  deviation  from  the  idealization  is  noise. 
Empirical  statistical  loci  typically  have  the  structure  of 
interest  obscured  by  noise,  i.e.,  random  perturbations  of  the 
positions  of  the  points.  Again  scale  seems  the  key:  If  the 
scale  of  the  noise  is  small  with  respect  to  that  of  meaningful 
structure,  then  there  should  be  no  serious  problems.  If  the 
.scale  of  noise  gets  too  large,  then  the  structure’s  image  under 
section  and  projection  may  be  obscured.  But  in  such  cases, 


one  might  argue  that  the  "tme"  shape  of  the  locus  has 
become  problematic  in  a  more  theoretically  fundamental 
way. 

5.  Discussion 

This  paper  has  examined  some  formal  capabilities  of  the 
two  geometric  transformations  section  and  projection.  It  was 
shown  that  they  have  complementary  strengths  and 
weakness  in  revealing  structure  of  various  dimensionality, 
and  that  together  they  form  a  powerful  composition. 

Although  the  systematic  joint  use  of  section  and 
projection  should  help  the  investigation  of  high  dimensional 
loci,  a  number  of  difficulties  remain.  The  challenges 
presented  by  singularities  of  codimension,  p<m-l,  and 
quantization  effects  have  already  been  mentioned.  By  far  the 
most  important  outstanding  problem  regards  the 
comprehensive  assessment  of  shape.  Section  and  projection 
are  certainly  among  the  fundamental  graphical  tools  for 
getting  relevant  information,  but  the  geometry  of  higher 
dimensions  is  fantastically  rich,  and  even  the  most 
informative  individual  2-D  images  can  only  capture 
glimpses  of  aspects  of  the  shape. 

Thus  there  are  at  least  three  important  major  directions 
of  future  research.  The  first  involves  getting  the  most  from 
each  low  dimensional  image,  which  requires  understanding 
what  these  transformations  do  to  a  variety  of  features  of  a 
locus.  The  feature  of  dimensionality  was  the  focus  of  this 
paper.  Examples  of  other  important  aspects  (with  some 
conjectured  results  given  in  parentheses)  are;  what  do  the 
transformations  do  to  simple  aspects  like  distances 
(projection  shortens  but  never  lengthens  them;  section 
preserves  them),  angle,  position,  and  orientation;  convexity 
(preserved  in  both),  polytopality  (preserved  by  both  -  but 
what  about  the  number  of  faces,  etc.),  connectedness 
(preserved  in  projection,  but  not  section).  A  systematic 
understanding  of  these  will  eruich  the  ability  to  understand 
how  a  given  picture  relates  to  the  object  pictured. 

A  second  direction  for  future  work  is  how  to  make  use  of 
other  techniques,  such  as  projections  that  preserve  density 
information,  probing  (a  technique  closely  related  to 
projection),  the  use  of  regular  sampling  grids,  etc. 

The  third  major  direction  involves  the  efficient  collection 
and  assembly  of  multiple  glimpses  to  capture  the  whole 
structure.  There  has  been  considerable  work  on  algorithms 
for  the  assessment  of  shape  from  projections,  motivated  by 
the  field  of  tomography  ''  *  There  has  also  been  some 

general  work  on  inferring  shapes  of  polytopes  using 
probing*^''.  Further  work,  encompassing  both  section  and 
section-then-projection  will  be  needed. 

An  additional,  independent  issue,  concerns  the 
psychological  aspects  of  high-dimensional  visualization. 
The  formal  treatments  can  explore  the  question  about  what 
kind  of  information  is  theoretically  available  from  various 
tools,  information  that  could  be  used  by  some  arbitrary 
intelligent  machine.  It  is  a  further  question  what  kind  of 
information  can  be  captured  and  integrated  by  human 
intelligence,  for  example  to  support  useful  valid  inference 
about  the  locus  as  a  result  of  the  low  dimensional  views. 
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A  DEMONSTRATION  OF  THE  DATA  VIEWER 


Catherine  Hurley 
University  of  Waterloo 

ABSTRACT 

We  have  designed  and  implemented  a  program  called  data  viewer  for  explor¬ 
ing  multivariate  data  sets.  The  program  produces  plots  moving  in  real-time 
by  projecting  onto  a  sequence  of  user-controlled  planes.  Multiple  plots  may 
be  simultaneously  controlled,  allowing  dynamic  comparisons  of  data  sets. 
In  this  presentation,  we  demonstrate  the  data  viewer  by  describing  and 
interpreting  a  selection  of  plots. 


1.  Introduction 

Recent  computing  advances  have  encouraged  the 
development  of  new  data  analytic  methods,  many  of 
them  graphical  in  nature  (Cleveland  1987).  We  have 
been  concerned  with  graphical  methods  for  analyzing 
multivariate  data.  Typically,  multivariate  data  is 
projected  onto  some  low  (one  or  two)  dimensional 
subspace  prior  to  display.  Motion  graphics  present 
us  with  one  way  of  improving  on  the  resulting  display 
—  simply  show  a  new  projection  every  fraction  of  a 
second.  The  PRIM  system  of  Fisherkeller,  Friedman 
and  Tukey  (197-1)  was  an  early  demonstration  of  this 
technique:  they  used  motion  to  display  a  rotating  3-d 
point  cloud.  Programs  for  3-d  rotations  have  become 
widely  available  in  the  last  few  years.  In  our  data 
viewer  program,  we  go  beyond  3-d  rotations  and  use 
motion  to  display  data  sets  with  arbitrary  numbers  of 
variables.  Hriefly,  the  program  produces  moving 
plots  by  projecting  the  observations  onto  .smoothly 
changing  sequences  of  planes.  This  presentation 
demonstrates  the  data  viewer  by  describing  and 
interpreting  a  selection  of  plots. 

As  background,  vve  mention  some  important  aspects 
of  the  data  viewer  design.  These  will  be  illustrated 
throughout  the  sections  which  follow. 

•  Constructing  moving  projections 

We  consider  2-d  projections  displayed  by  a  scat- 
terplol ,  and  1-d  projections  displayed  with  a 
marginal  density  estimate.  Changing  the  projec¬ 
tion  results  in  a  moving  scalterplot  or  demsity 


plot  appearing  on  the  screen.  With  the  data 
viewer,  the  user  controls  the  sequence  cif 
projections,  implying  also  that  he/she  may 
choose  particular  projections  for  display.  The 
projection  sequence  is  constructed  by  interiiolat- 
ing  between  consecutive  elements  of  a  user- 
chosen  sequence  of  target  i)lanes.  For  more 
details,  see  Hurley  (1987).  Hurley  and  Hiija 
(1988). 

•  The  user-interface 

Real-time,  rather  than  animated,  motion  is 
preferable  for  analyzing  data.  .All  data  viewer 
plots  are  produced  in  real-time,  which  calls  for 
real-time  user-controls.  For  this  reason,  we 
equip  the  program  with  a  graphical  user- 
interface,  where  the  user  communicates  with  the 
program  by  pointing  a  mouse  at  some  part  of  the 
data  viewer  display,  and  depressing  a  mouse  but¬ 
ton.  Further  details  are  given  in  Buja  et  al 
(1987),  Hurley  (1987). 

2.  The  data  viewer  window 

The  data  viewer  program  produces  plots  in  some  area 
on  the  screen  which  we  refer  to  as  a  data  viewer  win¬ 
dow.  Figure  1  shows  one  such  window,  displaying  a 
view  of  the  St. Helens  data  set.  This  data  set  con¬ 
tains  680  rjbservations  on  earthquakes  occurring  in 
the  vicinity  of  Mount  St.  Helens,  during  May,  1980, 
where  the  quantities  recorded  are  date,  latitude,  long- 
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itude,  depth  and  magnitude. 

There  are  a  fixed  set  of  items  appearing  in  a  data 
viewer  window.  These  are  a  plot,  a  title,  the  variable 
boxes  on  the  left  hand  side,  a  control  panel  in  the 
lower  left  corner,  and  a  plot  interaction  menu  lying 
next  to  the  control  panel.  Each  o,  these  items 
displays  some  information  relevant  to  the  user.  In 
addition,  the  items  are  mouse  sensitive,  and  respond 
to  mouse  clicks  by  changing  their  appearance.  Most 
user-program  interaction  occurs  in  this  way.  For 
example,  by  clicking  on  various  parts  of  the  control 
panel,  the  user  controls  some  aspects  of  the  scatter- 
plot  motion,  such  as  the  speed,  and  direction  (for¬ 
wards  or  backwards). 


JfcKiiS—J - - 


Figure  1:  A  data  viewer  window 

In  figure  1,  the  boxes  for  date  and  latitude  have 
horizontal  and  vertical  lines  drawn  from  their 
centers,  telling  us  that  the  displayed  plot  is  a  bivari¬ 
ate  scatterplot  of  date  and  latitude.  A  variable 
box  has  a  label  X,  Y,  A  or  blank  appearing  on  the 
top  left  hand  corner.  These  labels  have  a  special  pur¬ 
pose  —  they  determine  which  projections  may  be 
shown.  An  A  label  sigtrifies  that  the  variable  is 
active  and  may  appear  in  the  current  projection. 
With  an  X  (  Y)  label,  the  variable  is  allowed  to  have 
a  projection  coefficient  for  the  horizontal  (vertical) 
direction  only.  No  label  indicates  that  the  variable  is 
inactive,  and  so  has  zero  horizontal  and  vertical  coef¬ 
ficients.  Mouse  clicks  in  the  variable  boxes  are  used 
to  change  the  labels. 

Tlic  plot  interaction  menu  controls  the  style  of  plot 
interaction,  where  the  current  possibilities  are  point 
identification,  shifting  and  scaling  of  the  plot  axes 
(see  Buja  et  al,  1087)  rotation  of  the  plot  in  the 
plane,  and  moving  projections.  For  instance,  when 


point  identification  is  the  current  selection,  clicking 
near  one  of  the  point  symbols  causes  a  label  to 
appear.  In  the  examples  presented  here,  we  are  con¬ 
cerned  with  moving  projections,  so  the  plot  interac¬ 
tion  menu  shows  "PROJECTION".  This  implies  that 
clicking  in  the  plot  region  causes  a  mo'’ing  oroiection 
to  appear. 


Figure  2:  A  density  estimate 


The  data  viewer  program  also  displays  1-d  projec¬ 
tions,  by  plotting  a  marginal  density  estimate  for  (he 
projected  observations.  For  example,  figure  2  shows 
a  density  estimate  of  latitude.  (The  density  es', 
mate  is  an  average  shifted  histogram  (Scott  1985).) 
As  in  figure  1,  the  box  for  this  variable  has  a  horizon¬ 
tal  line  and  an  X  label.  Since  the  plot  shows  a  1-d 
projection,  there  is  no  box  with  a  vertical  line. 

3.  3-D  Rotations 

In  figure  3(a),  wc  have  picked  out  the  3- variable  sub¬ 
space  consisting  of  latitude,  longitude  and 
depth,  by  marking  their  respective  boxes  with  A 
labels.  The  pair  of  variables  latitude  and  depth 
are  in  the  plane  of  the  screen,  while  the  third,  long¬ 
itude,  is  perpendicular  to  the  screen.  Notice  that 
the  mouse  cursor  is  positioned  on  the  right  hand  side 
of  the  scatterplot.  With  a  left  mouse  click  al  this 
position,  the  point  cloud  rotates  towards  the  mouse 
cursor.  More  precisely,  the  point  cloud  spins  in  the 
direction  given  by  the  center  of  the  plot  region  and 
the  cursor  position.  A  mouse  click  in  the  plot  region 
as  the  points  are  moving  slops  the  motion.  The  next 
click  restarts  the  rotation,  in  the  direction  specified 
by  the  current  position  of  the  mouse  cursor.  With 
these  controls,  the  user  can  spiti  a  3-d  point  cloud  in 
any  direction.  Figure  3(b)  shows  a  picture  of  tin 
data  viewer  window  after  some  point  cloud  rotations. 


109 


Notice  now  that  lines  are  drawn  in  all  three  lati¬ 
tude,  longitude  and  depth  boxes.  The  lines  are 
in  fact  the  projections  of  the  three  coordinate  axes. 


Figure  3:  3-D  rotations 


4.  Linking  for  close-up  views 

All  of  the  plots  shown  so  far  demonstrate  that  earth¬ 
quake  locations  are  highly  concentrated,  so  that  it  is 
hard  to  see  the  structure  of  the  dense  cluster.  For 
this,  separate  plots  of  the  high-density  region  are 
necessary.  Suppose  the  data  set  Su . Helens-dense 
contains  the  subset  of  cases  in  the  high-density 
region.  To  view  this  subset  separately,  we  may  con¬ 
struct  a  second  data  viewer  window.  Figure  I  shows 
two  data  viewer  windows,  one  each  for  the 
St. Helens  and  St . Helens-dense  data  sets.  In 
both  windows,  the  cases  belonging  to  the  dense  sub¬ 
set  are  drawn  with  srpiare  glyphs,  while  the  remain¬ 
ing  points  have  hollow  circular  glyphs.  By  comparing 
the  lines  drawn  in  the  variable  boxes,  we  sec  that  the 
two  windows  show  the  same  projecticjii.  This  imi)lies 
that  the  scatterplot  in  the  lower  window  is  a  "close- 
up"  of  the  upper  scatterplot. 

As  before,  pointing  the  mouse  cursor  at  the  plot 
region  in  the  upper  window  and  clicking  causes  the 
point  cloud  to  rotate.  However,  this  time  the  point 


cloud  in  the  lower  window  also  rotates,  and  in  the 
same  direction.  This  is  because  the  second  window 
was  constructed  in  a  special  way,  in  order  to  link  it 
to  the  existing  window.  In  this  case,  simultaneous 
motion  of  the  two  scatterplots  permits  a  dynamic 
data  set  comparison,  because  the  second  window 
displays  „liroughouL  a  ciose-up  of  the  first. 


Figure  1:  A  close-up  view 


5.  Connecting  plots 

Figure  5  shows  a  data  viewer  window  for  the  Places 
data  set.  This  data  consists  of  scores  for  329  US 
cities  on  9  criteria,  chosen  to  measure  "livability"  of 
the  cities  (Rand  McNally  1986).  The  nine  criteria  are 
climate,  housing,  health  care,  crime,  tran.sporlation, 
education,  the  arts,  recreation  and  economics.  For 
housing  and  crime,  the  lower  the  score  the  belter. 
For  all  other  variables,  the  higher  the  score,  the 
better.  Three  additional  variables  are  included, 
namely,  population  (transformed  to  a  log  scale),  lati¬ 
tude  and  longitude  for  each  of  the  329  cities. 

The  upper  plot,  figure  .^(a)  gives  a  bivariate  scatter¬ 
plot  of  latitude  and  longitude.  The  two  "extra" 
points  on  the  left  hand  side  of  the  map  represent 
Anchf)rage,  .Maska,  and  Honolulu,  Hawaii.  Theli  l.iii- 
tude  and  longitude  cix^rdinates  have  been  adjusted  so 


that  all  cities  fit  nicely  into  the  plot  region.  The 
middle  plot  shows  a  bivariate  scatterplot  of  climate 
and  housing.  Instead  of  changing  the  display 
immediately  from  one  bivariate  scatterplot  to 
another,  we  can  gain  a  lot  of  information  by  watching 
a  smooth  progression  from  one  scat  terplot  to  another, 
".nd  t'aeV  egs'n.  We  rail  this  rnnne.ctinq  the  scatter- 
plots.  In  this  way,  we  discover  which  U.S.  cities  have 
good  or  bad  climate,  and  expensive  or  cheap  housing 
prices. 


Figure  .5;  Connecting  scatlerplots 


We  construct  the  sequence  of  projections  which  con¬ 
nect  the  scatterplots  shown  in  Tigure  5(a)  and  (b)  a.s 
follows:  Suppose  the  window  currently  displays  Cli¬ 
mate  and  housing,  and  we  pick  longitude  and 
latitude  as  the  target  plot.  Motion  resumes  with  a 
click  on  a  mouse  button,  proceeding  from  the  current 


to  the  target  plot.  Briefly,  the  horizontal  projection 
vector  rotates  in  the  climate,  longitude  plane, 
while  the  vertical  projectioi,  vector  rotates  simultane¬ 
ously  at  the  same  rate  in  the  housing,  latitude 
plane.  When  the  projection  reaches  the  target, 
motion  pauses  momentarily,  and  then  resumes  back 
towards  the  climate,  housing  plot.  The 
displayed  projection  continues  to  cycle  between  these 
two  scatterplots  until  the  user  intervenes. 

The  third  plot  in  figure  5  shows  one  of  the  intermedi¬ 
ate  projections.  From  the  variable  boxes  we  see  that 
both  climate  and  longitude  have  non-zero  pro¬ 
jections  in  the  horizontal  direction,  similarly,  hous¬ 
ing  and  latitude  in  the  vertical  direction.  By 
watching  the  smooth  progression  repeatedly  between 
the  pair  of  scatterplots  shown  in  figure  5  (a)  and  (b), 
we  gain  the  following  information: 

•  The  cluster  of  points  with  the  best  climate  are 
ail  Californian  cities.  They  also  have  high  hous¬ 
ing  prices. 

•  Highest  housing  costs  are  in  the  vicinity  of  N  >w 
York.  (The  two  points  with  very  high  scores  on 
housing  are  actually  Connecticut  cities). 

•  The  mid-west  has  the  worst  climate:  Minnesota, 
Wisconsin,  and  the  Dakotas. 

6.  Linking  to  compare  transformed  data 

Some  of  the  ratings,  in  particular  the-arts  and 
health-care,  give  extremely  high  scores  to  the  big¬ 
gest  citic.s-  New  York,  Chicago  and  L.A..  This 
results  in  scatterplots  where  most  of  the  observations 
are  clustered  together,  so  that  associations  between 
variables  are  hard  to  pick  out.  For  this  reason,  rat¬ 
ings  are  transformed  to  norma!  scores. 

Figure  6  shows  two  data  viewer  windows,  the  upper 
one  with  the  rating  variables  as  before,  and  the  lower 
one  with  the  normal  scores.  Both  windows  display  a 
density  estimate  for  a  linear  combination  of  the  rat¬ 
ing  variables.  The  linear  combination  is  the  same 
since  the  data  viewer  windows  are  linked  by  common 
projections.  Notice  the  dot  on  the  extreme  right  in 
the  upper  plot;  this  is  New  York.  In  the  lower  plot. 
New  York  lies  far  closer  to  the  other  cities.  As  the 
projection  vector  moves  in  the  space  spanned  by  the 
X-variables,  we  see  how  the  transformation  to  normal 


III 


scores  affects  marginal  distributions.  The  density 
estimate  in  the  lower  window  is  generally  symmetric, 
and  quite  often  looks  "bell-shaped".  For  the 
untransformed  ratings,  the  l-d  projections  have 
highly  skewed  distributions.  With  a  moving  x-vector, 
the  density  s  peak  shifts  to  and  fro  across  the  screen. 


Figure  6:  Comparing  transformed  data 


7.  Predictor-response  plots 

Most  of  the  nine  rating  variables  tend  to  assign  high 
values  to  big  cities.  To  judge  the  overall  nature  of 
the  association  between  population  and  the  ratings, 
we  examine  plots  of  population  against  linear  combi¬ 
nations  of  the  rating  variables.  Suppose  we  pick 
population  as  the  single  Y-variable,  and  make  each 
of  the-arts,  health-care,  economics,  educa¬ 
tion  and  recreation  X-variables.  (f'rom  the 
bivariate  scatterplots,  these  five  have  the  strongest 
individual  associations  with  population.)  Then, 
motion  yields  a  plot  of  population  against  a  chang¬ 
ing  linear  combination  of  the  five  X-variables. 

By  watching  the  moving  scatlcrplot,  we  di.scovcr  a 
projection  with  high  x-y  association,  as  shown  in  fig¬ 
ure  7(a).  We  can  sec  that  population  is  linearly 
related  to  a  weighted  average  of  the  five  selected  rat¬ 
ing  variables.  Also,  health-care  and  the-arts 


have  the  largest  coefficients,  whereas  the  coefficients 
for  economics  and  recreation  are  comparativi'|\ 
small,  (The  variables  have  been  transformed  to  nor¬ 
mal  scores,  so  that  it  is  reasonable  to  compare  their 
projection  coefficients.) 

Do  the  variables  economics  and  recreation  have 
a  negligible  contribution  to  the  x-y  a.ssociation  tn  tho 
above  projection?  We  may  answer  this  question  as 
follows.  .Suppose  we  deactivate  the  two  ^•a^lables 
economics  and  recreation,  thus  requiring  them 
to  have  zero  projection  coefficients  in  succeeding  tar¬ 
get  planes.  In  particular,  the  x-vector  for  the  next 
target  will  be  the  current  x-vector  orthogonalized 
w'ith  regard  to  the  two  deactivated  variables.  With  a 
rotation  towards  this  target,  we  receiv'e  a  visual 
impression  of  how'  th*'  equality  of  the  x-y  association 
deteriorates  (if  at  all),  as  the  coeffie'ents  of  the  two 
variables  shrinks  to  zero.  The  second  plot  in  figure  7 
shows  the  projection  onto  the  new  target.  Overall,  it 
looks  very  similar  to  the  previous  plot,  wi*h  most 
changes  occurring  among  cities  with  lower  popula¬ 
tion.  As  far  as  the  eye  can  judge,  economics  and 
recreation  do  not  contribute  to  the  x-y  association 
observed  in  the  upper  plot. 


Figure  7:  Exploratory  regression 
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8.  Data  derived  variables 

The  data  viewer  can  also  display  plots  of  principal 
components,  canonical  variates  or  the  linear  discrim¬ 
inants.  Indeed,  the  user  may  choose  any  linear  com¬ 
binations  to  form  additional  variables,  but  as  a  rule, 
data  derived  combinations  will  be  the  most  useful. 


Figure  8  shows  a  data  viewer  window  for  the  Places 
data,  with  additional  boxes  on  the  right  hand  side  for 
some  new  \  iriables.  In  this  case,  the  new  variables 
were  obtained  by  performiag  a  discriminant  analysis 
using  the  nine  ratings,  where  the  cities  where  '■lassed 
bv  location  into  (i)  west  coast  states  plus  Alaska  and 
Hawaii,  (ii)  Rocky  mountain  states,  (iii)  mid-west 
states,  (iv)  south-west  states,  (v)  south-east  states 
and  (vi)  north-east  states.  The  purpose  is  to  discover 
how  the  ratings  vary  across  locations. 

The  user  first  marks  each  of  the  location  groups  with 
a  different  plotting  symbol,  and  then  asks  the  data 
viewer  to  compute  the  discriminants  on  the  basis  of 
the  ratings.  When  the  calculations  are  complete,  the 
data  viewer  redraws  its  window,  with  more  boxes  on 
the  r.h.s.  for  the  derived  variables.  The  additional 
boxes  are  ordered  column-wise,  and  labeled  dlscr-1 
through  dlscr-5  for  the  di.scriminants,  rest-1  ... 
rest-4  for  some  dummy  variables,  and  followed  by 
population,  latitude  and  longitude. 

Now  the  user  can  specify  (moving)  projections  in 
terms  of  either  the  l.h.s.  or  r.h.s.  variables.  Figure  8 
displays  a  projection  obtained  by  performing  .1-d 
rotations  in  the  space  spanned  by  the  first  three 
discriminants.  This  projection  gives  good  separation 
of  the  west  coast  and  north-east  states  in  the  hor¬ 
izontal  direction.  For  a  clearer  pre.sentation  of  the 
two  groups,  they  are  marked  with  large  squares  and 


open  circles  respectively,  while  cities  in  other  regions 
are  not  shown.  The  l.h.s.  boxes  show  which  rating 
variables  contribute  to  the  separation.  Note  that 

•  climate  has  a  large  positive  coefficient  in  the 
horizontal  direction;  since  west  coast  cities  lie  to 
the  right  of  east  coast  cities  in  the  scatterplot, 
this  implies  that  the  west  coast  has  better  cli- 
ni3.t/C . 

•  Health-care  and  education  have  moderately 
sized,  but  negative,  horizontal  coefficients. 
Therefore,  it  seems  as  if  east  coast  cities  offer 
superior  health-care  and  education  facilities. 

•  Recreation,  transportation  and  econom¬ 
ics  have  little  or  no  impact  on  the  separation 
observed. 

9.  Conclusion 

This  presentation  aimed  to  illustrate  some  of  the 
capabilities  of  the  data  viewer  program,  through 
describing  some  of  the  diaplays  produced.  Wi^h  a 
system  which  relies  so  heavily  on  real-time  motion 
and  real-time  graphical  interaction,  a  textual  descrip¬ 
tion  of  a  few  static  plots  is  at  best  a  poor  substitute 
for  a  "live"  demonstration.  However,  we  would  hope 
to  have  convinced  the  reader  of  the  potential  of  data 
analysis  tools  such  as  data  viewa 
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ABSTRACT 

By  means  of^paral^l  coordinates  a  non-projective  mapping  between  subsets  of  into  subsets  of 
{i.e.  2*  -►  2*  )  is  obtained.  In  this  way  not  only  N-tuples  but  also  relations  among  N 
variables,  for  any  positive  integer  N,  can  be  visualized  in  terms  of  their  planar  images  These 
planar  diagrams  have  geometrical  properties  corresponding  to  some  properties  of  the  N-dimensional 
relation  they  represent.  Starting  from  a  point*-  -*  line  duality  when  ^=2,  the  representation  of 
lines  in  R^  is  given  and  illustrated  by  an  application  to  Air  Traffic  Control  (i  e.  for  R*).  It  is 
followed  by  the  representation  of  hyperplanes,  and  more  general  hypersurfaces.  There  is  an 
algorithm  for  constructing  and  displaying  any  interior  point  to  such  a  hypersurface  showing  some 
local  fi.e.  near  the  point)  properties  of  the  hypersurface  and  information  on  the  point’s  proximity  to 
the  boundary. 


Introduction 

Other  than  a  superficial  similarity  to 
Nomography  Parallel  Coordinates  were  first 
formulated  in  1978  with  the  first  report  appearing 
in  1981  (see  [12]).  They  provide  a  methodology  for 
visualizing  not  only  N-Dimensional  points  but  also 
N’-Dimensionai  Hypersurfaces  (i  e.  relations  among 
N  variables)  for  arbitrary  N  the  method  being  the 
same  for  every  N.  Other  methodologies  (but  not 
suited  for  multivariate  relations)  are  well  known 
(see  [1],  [2],  [4]  and  the  bibliographies  in  [5]  and 
[IS],  for  example).  Some  applications  of  parallel 
coordinates  can  be  found  in  [6],  [7],  [12],  [13],  [15], 
[16],  [17]  and  [18]. 


On  the  plane  with  xy-Cartesian  coordinates,  and 
starting  on  the  y-axis,  N  copies  of  the  real  line, 
labeled  xi ,  X2,  .  .  xn,  are  placed  equidistant  and 
perpendicular  to  the  x-axis.  They  are  the  axes  of 
the  parallel  coordinate  system  for  Euclidean  N- 
Ehmensional  Space  R^  all  having  the  same  positive 
orientation  as  the  y-axis  —see  Figure  1.  A  point  C 
with  coordinates  (ci ,  C2 . c//)  is  represented  by 


Parallel  Coordinates 


the  polygonal  line  whose  N  vertices  are  at  ( i  —  1  .  c, ) 
on  the  jc.-axis  for  i  =  l,  ,  N.  In  effect,  a  1-1 
correspondence  between  points  in  and  planar 
polygonal  lines  with  vertices  on  xi  ,  X2,  ■  • .  is 
established.  A  convex  hypersurface  in  is  repre¬ 
sented  by  the  envelope  of  the  family  of  polygonal 
lines  representing  all  points  on  the  hypersurfacc  (sej 
[3]).  In  short,  a  non-projective  mapping  2*  -►  2* 
is  established.  The  key  idea  is  that  the  descriotion 
of  a  higher  dimensional  object  is  captured,  to  a 
considerable  extent,  in  the  2-dimensional  represeniaiion 
of  the  envelope  of  the  polygonal  lines  representing 
its  points. 

Points  are  denoted  by  capitals  and  lines  (or  arcs  of 
curves)  by  lower-case  letters  respectively.  In  parallel 


coordinates,  the  corresponding  symbols  are  shown 
with  a  bar  superscript  (i.e.  /  represents  the  line  I, 
P  represents  the  point  P  etc  ). 


The  Fundamental  Point  >*-  -►  Line  Duality 


Points  on  the  plane  are  represented  by  segments 
between  the  .xi  and  x2-axis  and,  in  fact,  by  the  line 
containing  the  segment.  In  Figure  2,  the  distance 
between  the  xi  and  X2  axes  is  "d".  The  line 

/  :  *2  =  mx,  +  i  ,  m  <  00 


is  the  collection  of  points  A  They  arc  represented 
by  the  infinite  collection  of  lines  A  on  the  x>‘-plane 
which  when  m  1  intersect  at  the  point: 


given  with  respect  to  the  xy-Cartesian  coordinates 
The  reason  for  representing  the  point  P  by  the  whole 
line  P,  rather  than  just  the  segment  between  the 
parallel  axes,  is  that  f  may  lie  outside  the  strip 
between  the  axes.  For  lines  with  m=  1,  we  consider 
xy  and  xiX2  as  two  copies  of  the  Projective  Plane  [8] 
so  that  ihc  lire  t  corresponds  to  the  ideal  point  f 
with  tangent  direction  (i  e.  slope)  b/d.  Conversely, 
in  the  X|X2-projcctive  plane  the  ideal  point  with  slope 
m  IS  mapped  into  the  vertical  line  at  x  =  J/(l  -  m)  of 
the  xy-projcctivc  plane  Hence,  we  have  a  duality 
between  points  and  lines  of  the  Projective  Plane 
This  duality  as  expressed  by  means  of  homogeneous 
coordinates  is  a  linear  transformation  a—correla- 
otw-between  the  line  coordinates  [  m,  -1,6]  of  I 
and  the  point  coordinates  {d.b,  l—m)  of  1: 

(/)=A[/1 

where  |C]  and  (/),  the  line  and  point  (homogeneous) 
coordinates  respectively,  arc  taken  as  column  vectors 
and  A  IS  a  non-singular  3x3  matrix. 

By  means  of  the  correlation  above  the  collection 
of  points  on  a  curve  is  mapped  into  a  collection  of 
lines  which  can  be  considered  as  tangents  to  another 
curve  On  the  plane  conics  map  into  conics  (sec 
(9))  Actually,  this  property  is  more  general  and 
applies  to  generalized  conics  Consider  a  double  cone 
whose  base  is  any  bounded  convex  set  as  shown  in 
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Figure  5.  As  in  the  ordinary  conics,  three  kinds  of 
planar  sections  exist,  those  having  bounded,  unbounded 
or  two  disjoint  unbounded  components.  By  analogy  to 
the  ordinary  conics  they  are  called  esuirs,  psutrs  and 
hstars  (the  "e"  for  ellipse,  "p"  for  parabola  and  "h" 


for  hyperbola)  respectively.  Collectively,  they  are 
referred  to  as  gconics.  It  turns  out  that  gconics  map 
into  gconics  (see  [13])  and  in  particular  estars  map 
into  hstars  shown  in  Figure  3  This  yields  a  new 
duality  betweem  bounded  and  unbounded  convex 
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sets  and  hstars  as  well  as  a  duality  between  Convex 
Merge  (Convex  Union)  and  Intersection.  Based  on 
these  results  efficient  new  algorithms  for  Convex 
Hull  construction,  and  the  Convex  Merge  and  In¬ 
tersection  of  Convex  sets  were  derived  (see  [17]). 
For  non-convex  curves  there  is  a  surprising  duality 
between  cusps  and  inflection  points.as  shown 


Lines  in 

^^onsider  now  a  line  /  in  7?^  described  by: 

^1+1  :  +  i=X  -,N. 

m,jtO 

In  the  j^xi+i-plane  the  relation  labeled  /,j^|  is  a 
line  and  by  the  correlation  Ca  translated  appropri¬ 
ately  it  is  represented  by  the  point 


A.i+1  •(  '  ”  1  + 


1 

1  —m^ '  1  —m, 


). 


There  are  I  such  independent  relations  in  the 
given  set  of  equations,  ergo  the  line  /  is  represented 
by  the  corresponding  A'  -  1  points.  For  example,  in 
Figu'.e  4  we  see  several  points  on  a  line  interval  in 
Jt'°.  It  is  clear  from  the  diagram  how  a  point  can 
be  constructed  on  the  line,  for  any  given  initial 
value  of  one  of  the  variables  It  is  also  clear  how, 
given  the  equations  or  the  coordinates  of  their  equiv¬ 
alent  points  in  parallel  coordinates,  points  on  the 
line  can  be  calculated.  It  also  turns  out  that  the 
minimum  distance  between  two  lines  is  "visible"  in 
parallel  coordinates  [16]  a  useful  property  in  problems 
involving  proximity  as  in  Air  Traffic  Control  see 
Figure  6.  The  time  axis  can  be  thought  of  as  a 
"clock"  and  at  any  given  time  T,  the  position  of 
the  aircraft  is  found  by  selecting  the  value  of  T  on 
the  T-axis. 


Hyperplanes  in  R^ 

Up  to  this  point  a  very  special  and  useful  fact 
concerning  straight  lines  has  not  been  men¬ 
tioned.  In  two  dimensions  a  line  in  Euclidean  space 
transforms  into  a  point  in  parallel  coordinates.  Ev¬ 
ery  line  parallel  to  such  a  line  also  transforms  into 
a  point  in  parallel  coordinates.  The  x-coordinate  of 


Figure  6:  —  The  trajectory  of  an  aircraft 
flying  on  a  straight  line  path  with  constant 
velocity  is  a  line  in  4-D  and  can  be  repre¬ 
sented  by  3  stationary  points. 

On  the  four  parallel  axes  a  polygonal  line 
shows  the  time,  value  on  the  T-axis,  when 
the  position  (xj  ,  X2  ,  xy  )  is  attained.  Even 
in  an  accurate  3-D  isometric  (above  left) 
the  aircraft  look  as  if  they  are  almost 
colliding,  the  information  in  parallel  coor¬ 
dinates  shows  that  this  is  not  the  case. 


every  such  line  is  the  same  as  the  x-coordinate  of 
every  other  such  line,  namely  l/(l-m).  That  is 
to  say,  the  set  of  parallel  lines  in  Euclidean  coor¬ 
dinates  transforms  into  a  vertical  line  in  parallel 
coordinates.  In  N-dimensions  a  set  of  parallel  lines 
transforms  into  M  -  1  vertical  lines.  This  is  the 
basis  for  the  representation  of  any  hyperplane  by 
Af  —  1  vertical  lines  and  a  polygonal  line  representing 
one  of  its  points.  In  Figure  7  a  planar  relation 
among  industrial  data  was  discovered  from  this  ob¬ 
servation. 


Hypersurfaces  in  R^ 

A  feel  for  the  power  of  the  representation  can 
be  gained  from  Figure  9  from  which,  with  a 
bit  of  practice,  the  vertices,  edges  and  faces,  and 
their  interrelationship,  of  the  hypercube  can  be  rec- 
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On  Some  Graphical  Representations  of  Multivariate  Data 

Masood  Bohrforoush  and  Edward  J.  Wegman 
George  Mason  Universiig 


1.  Introduction.  The  classic  scatter  diagram  is  a 
fundamental  tool  in  the  construction  of  a  model  for  data.  It 
allows  the  eye  to  detect  such  structures  in  data  as  linear  or 
nonlinear  features,  clustering,  outliers  and  the  like. 
Unfortunately,  scatter  diagrams  do  not  gener^iie  readily 
beyond  three  dimensions.  For  this  reason,  the  problem  of 
visually  representing  multivariate  data  is  a  difficult,  largely 
unsolved  one.  The  principal  difficulty,  of  course,  is  the  fact 
that  while  a  data  vector  may  be  arbitrarily  high  dimensional, 
say  n,  Cartesian  scatter  plots  may  only  easily  be  done  in  two 
dimensions  and,  with  computer  graphics  and  more  effort,  in 
three  dimensions.  Alternative  multidimensional  representations 
have  been  proposed  by  several  authors  including  Chernoff 
(1973),  Fienberg  (1979),  Cleveland  and  McGill  (1984a)  and 
Carr  et  al.  (1986). 

An  important  technique  based  on  the  use  of  motion  is 
the  computer-based  kinematic  display  yielding  the  illusion  of 
three  dimensional  scatter  diagrams.  This  technique  was 
pioneered  by  Friedman  and  Tukey  (1973)  and  is  now  available 
in  commercial  software  packages  (Donohoe's  MaeSpin  and 
Velleman's  Data  Desk).  Coupled  with  easy  data  manipulation, 
the  kinematic  display  techniques  have  spawned  the  exploitation 
of  such  methods  as  projection  pursuit  (Friedman  and  Tukey, 
1974)  and  the  grand  tour  (Asimov,  1985).  Clearly,  projection- 
based  techniques  lead  to  important  insights  concerning  data. 
Nonetheless,  one  must  be  cautious  in  making  inferences  about 
high  dimensional  data  structures  based  on  projection  methods 
alone.  It  would  be  highly  desireable  to  have  a  simultaneous 
representation  of  all  coordinates  of  a  data  vector  especially  if 
the  representation  treated  all  components  in  a  similar  manner. 
The  cause  of  the  failure  of  the  standard  Cartesian  coordinate 
representation  is  the  requirement  for  orthogonal  coordinate 
axes.  In  a  3-dimcnsionai  world,  it  is  difficult  to  represent  more 
than  three  orthogonal  coordinate  axes.  We  propose  to  give  up 
the  orthogonality  requirement  and  replace  the  standard 
Cartesian  axes  with  a  set  of  n  parallel  axes. 

2.  Parallel  Coordinates.  We  propose  as  a  multivariate 
data  analysis  tool  the  following  representation.  In  place  of  a 
scheme  trying  to  preserve  orthogonality  of  the  n-dimensional 
coordinate  axes,  draw  them  as  parallel.  A  vector  (X|,  X2t  •  » 
Xri)  is  plotted  by  plotting  X|  on  axis  1,  on  axis  2  and  so  on 
through  Xn  on  axis  n.  The  poinU  plotted  in  this  manner  arc 
joined  by  a  broken  line.  Figure  2.1  illustrates  two  points  (one 
solid,  one  dashed)  plotted  in  parallel  coordinate  representation. 
Id  this  illustration,  the  two  points  agree  in  the  fourth 
coordinate.  The  principal  advantage  of  this  plotting  device  is 
clear.  Each  vector  (X],  x^,  ...  ,  Xn)  is  represented  in  a  planar 
diagram  so  that  each  vector  component  has  essentially  the  same 
representation. 

The  parallel  coordinates  proposal  has  its  roots  in  a 
number  of  sources.  Griffcn  (1958)  considers  a  2-dimensional 
parallel  coordinate  type  device  as  a  method  for  graphically 
computing  the  Kendall  tau  correlation  coefficient,  flartigan 
(1975)  describes  the  '^profiles  algorithm'*  which  he  describes  as 
"histograms  on  each  variable  connected  between  variables  by 
identifying  cases."  Although  he  docs  not  recommend  drawing 
all  profiles,  a  profile  diagram  with  ail  profiles  plotted  is  a 
parallel  coordinate  plot.  There  is  however  far  more 
mathematical  structure,  particularly  high  dimensional 
structure,  to  the  parallel  coordinate  diagram  than  Ifartigan 


Figure  2.1  Parallel  coordinate  representation  of  two  n- 
dimensional  points. 

exploits.  Inselberg  (1985)  originated  the  parallel  coordinate 
representation  as  a  device  for  computational  geometry.  Bis 
1985  paper  is  the  culmination  of  a  series  of  technical  reports 
dating  from  1981.  Finally  we  note  that  Diaconis  and  Friedman 
(1983)  discuss  the  so-called  M  and  N  plots.  Their  special  case 
of  a  1  and  1  plot  is  a  parallel  coordinate  plot  in  two 
dimensions.  Indeed,  the  1  and  1  plot  is  sometimes  called  a 
before-and-aftcr  plot  and  has  a  much  older  history.  The 
fundamental  theme  of  this  paper  is  that  the  transformation 
from  Cartesian  coordinates  to  parallel  coordinates  is  a  highly 
structured  mathematical  transformation,  hence,  maps 
mathematical  objects  into  mathematical  objects.  Certain  of 
these  can  be  given  highly  useful  statistical  interpretations  so 
that  this  representation  becomes  a  highly  useful  data  analysis 
tool. 

3.  Parallel  Coordinate  Geometry.  The  parallel 
coordinate  representation  enjoys  some  elegant  duality  properties 
with  the  usual  Cartesian  orthogonal  coordinate  representation. 
Consider  a  line  L  in  the  Cartesian  coordinate  plane  given  by  Li 
y=mx  +  h  and  consider  two  points  lying  on  that  line,  say 
(a,  ma-hb)  and  (c,  mc+b).  For  simplicity  of  computation  we 
consider  the  xy  Cartesian  axes  mapped  into  the  xy  parallel  axes 
as  described  in  Figure  3.1.  We  superimpose  a  Cartesian 
coordinate  axes  t,u  on  the  xy  parallel  axes  so  that  the  y  parallel 
axis  has  the  equation  u  =  I.  The  point  (a,  ma  +  b)  in  the  xy 
Cartesian  system  maps  into  the  line  joining  (a,  0)  to  (ma+b,  1) 
in  the  tu  coordinate  axes.  Similarly,  (c,  mc  +  b)  maps  into  the 
line  joining  (c,  0)  to  (mc+b,  1).  It  is  a  straightforward 
computation  to  show  that  these  two  lines  intersect  at  a  point 
(in  the  tu  plane)  given  by  L:  (  b(l  — m)'\  (1— m)  *).  Notice 
that  this  point  in  the  parallel  coordinate  plot  depends  only  on 
m  and  b  the  parameters  of  the  original  line  in  the  Cartesian 
plot.  Thus  t  is  the  dual  of  JL  and  we  have  the  interesting 
duality  result  that  points  in  Cartesian  coordinates  map  into 
lines  in  parallel  coordinates  while  lines  in  Cartesian  coordinates 
map  into  points  in  parallel  coordinates. 

For  0  <  (1— m)’^  <  1,  m  is  negative  and  the 
intersection  occurs  between  the  parallel  coordinate  axes.  For 
m  =  — I,  the  intersection  is  exactly  midway.  A  ready  statistical 
interpretation  can  be  given.  For  highly  negatively  correlated 
pairs,  the  dual  line  segments  in  parallel  coordinates  will  tend  to 
cross  near  a  single  point  between  the  two  parallel  coordinate 
axes.  The  scale  of  one  of  the  variables  may  be  transformed  in 
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Figure  3.1  Cartesian  and  parallel  coordinate  plots  of  two 
points.  The  tu  Cartesian  coordinate  system  is  superimposed  on 
the  xy  parallel  coordinate  system. 

such  a  way  that  the  intersection  occurs  midway  between  the 
two  parallel  coordinate  axes  in  which  case  the  slope  of  the 
linear  relationship  is  negative  one. 

In  the  case  that  (1— m)‘*<0  or  (1— m)'*>l,  m  is 
positive  and  the  intersection  occurs  external  to  the  region 
between  the  two  parallel  axes.  In  the  special  case  m  =  lt  this 
formulation  breaks  down.  However,  it  is  clear  that  the  point 
pairs  are  (a,  a+b)  and  (c,  c+b).  The  dual  lines  to  these  points 
are  the  lines  in  parallel  coordinate  space  with  slope  b~^  and 
intercepts  — ab~^  and  — cb~’  respectively.  Thus  the  duals  of 
these  lines  in  parallel  coordinate  space  are  parallel  lines  with 
slope  b*^  We  thus  append  the  ideal  points  to  the  parallel 
coordinate  plane  to  obtain  a  projective  plane.  These  parallel 
lines  intersect  at  the  ideal  point  in  direction  b'^  In  the 
statistical  setting,  we  have  the  following  interpretation.  For 
highly  positively  correlated  data,  we  will  tend  to  have  lines  not 
intersecting  between  the  parallel  coordinate  axes.  By  autiabh 
linear  rescaling  of  one  of  the  variables,  the  lines  may  be  made 
approximately  p>ara!lel  in  direction  with  slope  b'^  In  this  case 
the  slope  of  the  linear  relationship  between  the  rescaled 
variables  is  one.  See  Figures  3.2  for  an  illujtration  of  large 
positive  and  large  negative  correlations.  Of  course,  nonlinear 
relationships  will  not  respond  to  simple  linear  rescaling. 
However,  by  suitable  nonlinear  transformations,  it  should  be 
possible  to  transform  to  linearity.  The  point-line,  line-point 
duality  seen  in  the  transformation  from  Cartesian  to  parallel 
coordinates  extends  to  conic  sections.  An  instructive 
computation  involves  computing  in  the  parallel  coordinate 
space  the  image  of  an  ellipse  which  turns  out  to  be  a  general 
hyperbolic  form.  For  purposes  of  conserving  space  we  do  not 
provide  the  details  here. 

It  should  l)e  noted,  however,  that  the  solution  to  this 
computation  is  not  a  locus  of  points,  but  a  locus  of  lines,  a  line 
conic.  The  envelope  of  thir  line  conic  is  a  point  conic.  In  the 
case  of  this  computation,  the  point  conic  in  the  original 
Cartesian  coordinate  plane  is  an  ellipse,  the  image  in  the 
parallel  coordinate  plane  is  as  we  have  just  seen  a  line 
hyperbola  with  a  point  hyperbola  as  envelope.  Indeed,  it  is  true 
that  a  conic  will  always  map  into  a  conic  and,  in  particular,  an 
ellipse  will  always  map  into  a  hyperbola.  The  converse  is  not 
true.  Depending  on  the  details,  a  hyperbola  may  map  into  an 
ellipse,  a  parabola  or  another  hyperbola.  A  fuller  discussion  of 
projective  transformations  of  conics  is  given  by  Dimsdalc 
(1984).  Inselberg  (1985)  generalizes  this  notion  into  parallel 
coordinates  resulting  in  what  he  calls  hstars. 

We  mentioned  the  duality  between  points  and  lines  and 
conics  and  conics.  It  is  worthwhile  to  point  out  two  other  nice 
dualities.  Rotations  in  Cartesian  coordinates  become 
translations  in  parallel  coordinates  and  vice  versa.  Perhaps 
more  interesting  from  a  statistical  point  of  view  is  that  points 
of  inflection  in  Cartesian  space  become  cusps  in  parallel 


coordinate  space  and  vice  versa.  Thus  the  relatively  hard-to- 
detect  inflection  point  property  of  a  function  becomes  the 
notably  more  easy  to  detect  cusp  in  tbe  parallel  coordinate 
representation.  Inselberg  (1985)  discusses  these  properties  in 
detail. 

4.  Further  Statistical  Interpretations.  Since  ellipses 
map  into  hyperbolas,  we  can  have  an  easy  template  for 
diagnosing  uncorrelated  data  pairs.  Consider  Figure  3.2.  With 
a  completely  uncorrelated  data  set,  we  would  expect  the  2- 
dimensional  scatter  diagram  to  fill  substantially  a 
circumscribing  circle.  As  illustrated  in  Figure  3.2,  the  parallel 
coordinate  plot  would  approximate  a  figure  with  a  hyperbolic 
envelope.  As  the  correlation  approaches  negative  one,  the 
hyperbolic  envelope  would  deepen  so  that  in  the  limit  we  would 
have  a  pencil  of  lines,  what  we  like  to  call  the  cross-over  effect. 
As  the  correlation  approaches  positive  one,  the  hyperbolic 
envelope  would  widen  with  fewer  and  fewer  crossovers  so  that 
in  the  limit  we  would  have  parallel  lines.  Thus  correlation 
structure  can  be  diagnosed  from  tbe  parallel  coordinate  plot. 
As  noted  earlier,  GrifTen  (1958)  used  this  as  a  graphical  device 
for  computing  the  Kendall  tau. 
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Figure  3.2  Parallel  coordinate  plot  of  6  dimensional  data 
illustrating  correlations  of 
p  =  1,  .8,  .2,  0,  -.2,  -.8  and  -1. 

Griffcn,  in  fact,  attributes  the  graphical  device  to 
Holmes  (1928)  which  predates  Kendall’s  discussion.  The 
computational  formula  is 


r  -  1  _  -ilL 

'  -  n(n-l) 

where  X  is  the  number  of  intersections  resulting  by  connecting 
the  two  rankings  of  each  member  by  lines,  one  ranking  having 
been  put  in  natural  order.  While  the  original  formulation  was 
framed  in  terms  of  ranks  for  both  x  and  y  axes,  it  is  clear  that 
the  number  of  crossings  is  invariant  to  any  monotone  increasing 
transformation  of  either  x  or  y,  the  ranks  being  one  such 
transformation.  Because  of  this  scale  invariance,  one  would 
expect  rank-based  statistics  to  have  an  intimate  relationship  to 
parallel  coordinates. 

It  is  clear  that  if  there  is  a  perfect  positive  linear 
relationship  with  no  crossings,  then  X  =  0  and  r  =  1. 
Similarly,  if  there  is  a  perfect  negative  linear  relationship. 
Figure  3.2  is  again  appropriate  and  we  have  a  pencil  of  lines. 
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Since  every  line  meets  every  other  line,  the  number  of 


intersections 


“(5) 


80  that 


r  =  1  - 


J(IL 

n(n-l) 


=  -1. 


It  should  be  further  noted  that  clustering  is  easily  diagnosed 
using  the  parallel  coordinate  representation. 


So  far  we  have  focused  primarily  on  pairwise  parallel 
coordinate  relationships.  The  idea  however  is  that  we  can^  so 
to  speak,  stack  these  diagrams  and  represent  all  n  dimensions 
simultaneously.  Figure  4.1  thus  illustrates  6>dimen8ional 
Gaussian  uncorrelated  data  plotted  in  parallel  coordinates.  A 
S-dimensionai  ellipsoid  would  have  a  similar  general  shape  but 
with  hyperbolas  of  different  depths.  This  data  is  deep  ocean 
acoustic  noise  and  is  illustrative  of  what  might  be  expected. 


o 


Figure  4.1  Parallel  coordinate  plot  of  6  channel  sonar  data. 
The  data  is  uncorrelatcd  Gaussian  noise.  The  second 
coordinate  represents  a  relatively  remote  hydrophone  and  has  a 
somewhat  different  mean.  Notice  the  approximate  hyperbolic 
shape. 


Figure  4.2  A  five  dimensional  scatter  diagram  in  parallel 
coordinates  illustrating  marginal  densities,  correlations,  three 
dimensional  clustering  and  a  five  dimensional  mode. 


Figure  4.2  thus  illustrates  some  data  analysis  features 
of  the  parallel  coordinate  representation  including  the  ability  to 
diagnose  one-dimensional  features  (marginal  densities),  two- 
dimensional  features  (correlations  and  nonlinear  structures), 
tbree-dimensiona)  features  (clustering)  and  a  five-dimensiona] 
feature  (the  mode).  In  the  next  section  of  this  paper  we 
consider  a  real  data  set  which  will  be  illustrative  of  some 
additional  capabilities. 

5.  An  Auto  Data  Example.  We  illustrate  parallel 
cooordinates  as  an  exploratory  analysis  tool  on  data  about  86 
1980  model  year  automobiles.  They  consist  of  price,  miles  per 
gallon,  gear  ratio,  weight  and  cubic  inch  displacement.  For  n 
=  5,  3  presentations  are  needed  to  present  all  pairwise 
permutations.  Figures  5.1,  5.2  and  5.3  are  these  three 
presentations.  In  Figure  5.1,  perhaps  the  most  striking  feature 
is  the  cross-over  effect  evident  in  the  relationship  between  gear 
ratio  and  weight.  This  suggests  a  negative  correlation.  Indeed, 
this  is  reasonable  since  a  heavy  car  would  tend  to  have  a  large 


Figure  4.2  is  illustrative  of  some  data  structures  one 
might  see  in  a  fivc-dimcnsional  data  set.  First  it  should  be 
noted  that  the  plots  along  any  given  axis  represent  dot 
diagrams  (a  refinement  of  the  histograms  o\  Hartigan),  hence 
convey  graphically  the  one-dimensional  marginal  distributions. 
In  this  illustration,  the  first  axis  is  meant  to  have  an 
approximately  normal  distribution  shape  while  axis  two  the 
shape  of  the  negative  of  a  As  discussed  above,  the  pairwise 
comparisons  can  be  made.  Figure  4.2  illustrates  a  number  of 
instances  of  linear  (both  negative  and  positve),  nonlinear  and 
clustering  situations.  Indeed,  it  is  clear  that  there  is  a  3- 
dimcnsional  cluster  along  coordinates  3,  4  and  4. 

Consider  also  the  appearance  of  a  mode  in  parallel 
coordinates.  The  mode  is,  intuitively  speaking,  the  location  of 
the  most  intense  concentration  of  probability.  Hence,  in  a 
sampling  situation  it  will  be  the  location  of  the  most  intense 
concentration  of  observations.  Since  observations  arc 
represented  by  broken  line  segments,  the  mode  in  parallel 
coordinates  will  be  represented  by  the  most  intense  bundle  of 
broken  line  paths  in  the  parallel  coordinate  diagram.  Roughly 
speaking,  wc  should  look  for  the  most  intense  flow  through  the 
diagram.  In  Figure  4.2,  such  a  flow  begins  near  the  center  of 
coordinate  axis  one  and  finishes  on  the  left-hand  side  of  axis 
five. 


Figure  5.1  A  parallel  coordinate  plot  in  five  dimensions  of 
automobile  data.  Note  the  negative  correlation  between  gc*r 
ratios  and  weight. 
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engine  providing  considerable  torque  thus  requiring  a  lower  gear 
ratio.  Conversely,  a  light  car  would  tend  to  have  a  small 
engine  providing  small  amounts  of  torque  thus  requiring  a 
higher  gear  ratio. 

Consider  as  well  the  relationship  between  weight  and 
cubic  inch  displacement.  In  this  diagram  we  have  a 
considerable  amount  of  approximate  parallelism  (relatively  few 
crossings)  suggesting  positive  correlation.  This  is  a  graphic 
representation  of  the  fact  that  big  cars  tend  to  have  big  engines, 
a  fact  most  are  prepared  to  believe.  Quite  striking  however  Is 
the  negative  slope  going  from  low  weight  to  moderate  cubic 
inch  displacement.  This  is  clearly  an  outlier  which  is  unusual 
in  neither  variable  but  in  their  joint  relationship. 

The  relationship  between  miles  per  gallon  and  price  is 
also  perhaps  worthy  of  comment.  The  left-hand  side  shows  an 
approximate  hyperbolic  boundary  while  the  right-hand  side 
clearly  illustrates  the  cross-over  effect.  This  suggests  for 
inexpensive  cars  or  poor  mileage  cars  there  is  relatively  little 
correlation.  However,  costly  cars  almost  always  get  relatively 
poor  mileage  while  good  gas  mileage  cars  are  almost  always 
relatively  inexpensive. 


Figure  5.2  The  second  permutation  of  the  five  dimensional 
presentation  of  the  automobile  data.  Notice  the  two  classes  of 
linear  relations  gear  ratio  and  miles  per  gallon. 


Turn;r.(,  to  Figure  5.2,  the  relationship  between  gear 
ratio  and  miles  per  gallon  is  instructive.  This  diagram  is 
suggestive  of  two  classes.  Notice  that  there  are  a  number  of 
observations  represented  by  line  segments  tilted  slightly  to  the 
right  of  vertical  (high  positive  slope)  and  a  somewhat  larger 
numter  with  a  negative  slope  of  about  —1.  Within  each  of 
these  two  classes  wc  have  approximate  parallelism.  This 
suggests  that  the  relationship  l>etwccn  gear  ratios  and  miles  per 
gallon  is  approximately  linear,  a  believable  conjecture  since  low 
gears  =  big  engines  =  poor  mi/cage  while  high  gears  =  small 
engines  =  good  mileage.  What  is  intriguing,  however,  is  that 
there  seems  to  l)c  really  two  distinct  classes  of  automobiles  each 
exhibiting  a  linear  relationship,  but  with  different  linear 
relationships  within  each  class. 

Indeed  in  Figure  5. .3,  the  third  permutation,  we  are 
able  to  highlight  this  separation  into  two  classes  in  a  truly  5- 
dirncnsional  sense.  The  shaded  region  in  Figure  5.3  describes  a 
class  of  vehicles  with  relatively  poor  gas  mileage,  relatively 
heavy,  relatively  inexpensive,  relatively  large  engines  and 
relatively  low  gear  ratios.  Figure  5.4  is  a  repeat  of  this  graphic 
but  with  different  shading  highlighting  a  class  of  vehicles  with 
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Figure  5.3  The  third  permutation  of  the  five  dimensional 
automobile  data.  Note  the  highlighting  of  the  domestic 
automobile  group. 

relatively  good  gas  mileage,  relatively  light  weight,  relatively 
inexpensive,  relatively  small  engines  and  relatively  high  gear 
ratios.  In  1980,  these  two  characterizations  describe 
respectively  domestic  automobiles  and  imported  automobiles. 

6.  Graphical  Eziensions  of  Parallel  Coordinate  Plots. 
The  basic  parallel  coordinate  idea  suggests  some  additional 
plotting  devices.  We  call  these  respectively  the  Parallel 
Coordinate  Density  Plots,  Relative  Slope  Plots  and  Color 
Histograms.  These  are  extensions  of  the  basic  idea  of  parallel 
coordinates,  but  structured  to  exploit  additional  features  or  to 
convey  certain  information  more  easily. 


Figure  5.4  The  third  permutation  showing  highlighting  of  the 
imported  automobile  group. 


6.1  Parallel  (^oodinatc  I>ensity  Plots.  While  the  basic 
parallel  coordinate  plot  is  a  useful  device  itself,  like  the 
conventional  scatter  diagram,  it  suffers  from  heAvy  overplotting 
with  large  data  sets.  In  order  to  gel  around  this  problem,  wc 
use  a  parallel  coordinate  density  plot  which  is  computed  as 
follows.  Our  algorithm  is  basest  on  the  Scott  (1985)  notion  of 
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average  shifted  histogram  (ASH)  but  adapted  to  the  parallel 
coordinate  context.  As  with  an  ordinary  two  dimensional 
histogram,  we  decide  on  appropriate  rectangular  bins.  A 
potential  difficulty  arises  because  a  line  segment  representing  a 
point  may  appear  in  two  or  mote  bins  in  the  same  horizontal 
slice.  Obviously  if  we  have  k  n-dimensional  observations,  we 
would  like  to  form  a  histogram  based  on  k  entries.  However, 
since  the  line  segment  could  appear  in  two  or  mote  bins  In  a 
horizontal  slice,  the  count  for  any  given  horizontal  slice  is  at 
least  k  and  may  be  bigger.  Moreover,  every  horizontal  slice 
may  not  have  the  same  count.  To  get  around  this,  we  convert 
line  segments  to  points  by  intersecting  each  line  segment  with  a 
horizontal  line  passing  through  the  middle  of  the  bin.  This 
gives  us  an  exact  count  of  k  for  each  horizontal  slice.  We 
construct  an  ASH  for  each  horizontal  slice  (typically  averaging 
5  histograms  to  form  our  ASH).  We  have  used  contours  to 
represent  the  two-dimensional  density  although  gray  scale 
shading  could  be  used  in  a  display  with  sufficient  bit-plane 
memory.  Because  of  our  inability  to  reproduce  color  or  gray¬ 
scale,  we  cannot  give  an  example  of  a  parallel  coordinate 
density  plot  in  this  paper.  Parallel  coodinate  density  plots  have 
the  advantage  of  being  graphical  representations  of  data  sets 
which  are  simultaneously  high  dimensional  and  very  large. 

6.2  Relative  Slope  Plots.  We  have  already  seen  that 
parallel  line  segments  in  a  parallel  coordinate  plot  correspond  to 
high  positive  correlation  (linear  relationship).  As  in  our 
automobile  example,  it  is  possible  for  two  or  more  sets  of  linear 
relationships  to  exist  simultaneously.  In  an  ordinary  parallel 
coordinate  plot,  we  see  these  as  sets  of  parallel  lines  with 
distinct  slopes.  The  work  of  Cleveland  and  McGill  (1984b) 
suggests  that  comparison  of  slopes  (angles)  is  a  relatively 
inaccurate  judgement  task  and  that  it  is  much  easier  to 
compare  magnitudes  on  the  same  scale.  The  relative  slope  plot 
is  motivated  by  thb.  In  an  n-dimensional  relative  slope  plot 
there  are  n  — 1  parallel  axes,  each  corresponding  to  a  pair  of 
axes,  say  x,-  and  Xy,  with  Xy  regarded  as  the  lower  of  the  two 
coordinate  axes.  For  each  observation,  the  slope  of  the  line 
segment  between  the  p.'’'r  nt  -ores  is  plotted  as  a  magnitude 
between  —1  and  -f  1.  The  maximum  positive  slope  is  coded  as 
-bl,  the  minimum  negative  slope  as  —1  and  a  slope  of  oo  as  0. 
The  magnitude  is  calculated  as  cos  i)  where  rj  is  the  angle 
between  the  Xy  axis  and  the  line  segment  corresponding  to  the 
observation.  Elach  individual  observation  in  the  relative  slope 
plot  corresponds  to  a  vertical  section  through  the  axis  system. 
An  example  of  a  relative  slope  plot  is  given  in  Figure  6.1. 

ZilMW  K 
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Figure  6.1  Relative  slope  plot  of  five  dimensional  automobile 
data.  Data  presented  in  the  same  order  as  in  Figure  5.4 


Notice  that  since  slopes  are  coded  as  heights,  simply  laying  a 
straightedge  will  allow  us  to  discover  sets  of  linear  relationships 
within  the  peur  of  variables  Xj  and  Xy. 

6.3  Color  Histograms.  The  basic  set-up  for  the  color 
histogram  is  similar  to  the  relative  slope  plots,  f  or  an  n- 
dimensional  data  set,  there  are  n  parallel  axes.  A  vertical 
section  through  the  diagram  corresponds  to  an  observation. 
The  idea  is  to  code  the  magnitude  of  an  observation  along  a 
given  axis  by  a  color  bin,  the  colors  being  chosen  to  form  a 
color  gradient.  We  typically  choose  8  to  15  colors.  The 
diagram  is  drawn  by  choosing  an  axis,  say  x^,  and  sorting  the 
observations  in  ascending  order.  Along  this  axis,  we  see  blocks 
of  color  arranged  according  to  the  color  gradient  with  the  width 
of  the  block  being  proportional  to  the  number  of  observations 
falling  into  the  color  bin.  The  observations  on  the  other  axes 
are  arranged  in  the  order  corresponding  to  the  x^  axis  and  color 
coded  according  to  their  magnitude.  Of  course,  if  the  same 
color  gradient  shows  up  say  on  the  Xn>  axis  as  on  the  x^,  then 
we  know  x^  is  positively  “correlated”  with  Xm.  If  the  color 
gradient  is  reversed,  we  know  the  “correlation”  is  negative.  We 
used  the  phrase  “correlation”  advisedly  since  in  fact  if  the  color 
gradient  is  the  same  but  the  color  block  sizes  ace  different,  the 
relationship  is  nonlinear.  Of  course  if  the  x,n  axis  shows  color 
speckle,  there  is  no  “correlation”  and  x^  is  unrelated  to  Xm- 
Again  we  are  unable  to  give  an  example  of  a  color  histogram  in 
this  paper  because  of  our  inability  to  reproduce  color  or  gray¬ 
scale. 

7.  /mplemeniaiions  and  Experiences.  Our  parallel 
coordinates  data  analysis  software  has  been  implemented  in  two 
forms,  one  a  PASCAL  program  operating  on  the  IBM  RT 
under  the  AIX  operating  system.  This  code  allows  for  up  to 
four  simultaneous  windows  and  offers  simultaneous  display  of 
parallel  coordinates  and  scatter  diagram  displays.  It  offers 
highlighting,  zooming  and  other  similar  features  and  also  allows 
the  possibility  of  nonlinear  rescaling  of  each  axis.  It 
incorporates  axes  permutations  and  also  includes  Parallel 
Coordinate  Density  Plots,  Relative  Slope  Plots  and  Color 
Histograms. 

Our  second  implementation  is  under  development  in 
PASCAL  for  MS-DOS  machines  and  includes  similar  features. 
In  addition,  it  has  a  mouse-driven  painting  capability  and  can 
do  real-time  rotation  of  3-dimensiooai  scatterplots.  Both 
programs  use  EGA  graphics  standards,  with  the  second  also 
using  VGA  or  Hercules  monochrome  standards. 

We  regard  the  parallel  coordinate  representation  as  a 
device  complementary  to  scatterplots.  A  major  advantage  of 
the  parallel  coordinate  representation  over  the  scatterplot 
matrix  is  the  linkage  provided  by  connecting  points  on  the  axes. 
This  linkage  is  difficult  to  duplicate  in  the  scatterplot  matrix. 
Because  of  the  projective  line-point  duality,  the  structures  seen 
in  a  scatterplot  can  also  be  seen  in  a  parallel  coordinate  plot. 
Moreover,  the  work  of  Cleveland  and  McGill  (1984b)  suggests 
that  it  is  easier  and  more  accurate  to  compare  observations  on 
a  common  scale.  The  parallel  coordinate  plot  and  the 
derivatives  of  it  de  facto  have  a  common  scale  and  so  for 
example  a  sense  of  variability  and  central  tendency  among  the 
variables  are  easier  to  grasp  visually  in  parallel  coordinates 
when  compared  with  the  scatterplot  matrix.  On  the  other 
hand,  one  might  interpret  all  the  ink  generated  by  the  lines  as  a 
significant  disadvantage  of  the  parallel  coordinate  plot.  Our 
experience  on  this  is  mixed.  Certainly  for  large  data  sets  on 
hard  copy  this  is  a  problem.  When  viewed  on  an  interactive 
graphics  screen  particularly  a  high  resolution  screen,  we  have 
often  found  that  individual  points  in  a  .scatterplot  can  get  lost 
because  they  are  simply  not  bright  enough.  That  does  not 
happen  in  a  parallel  coordinate  plot.  Howov,  if  many  points 
are  plotted  in  monochrome,  it  is  hard  to  distiugu.  b  between 
points.  We  have  gotten  around  this  problem  by  plotting 
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(liatinct  points  in  different  colors.  In  an  EGA  implementation, 
this  means  colors.  This  is  surprisingly  effective  in  separating 
points.  In  one  experiment,  we  plotted  5000  5-dimensionaj 
random  vectors  using  16  colors,  and  inspite  of  total 
overplotting,  we  were  still  able  to  see  some  structure.  In  data 
sets  of  somewhat  smaller  scale,  we  have  implement  a 
scintillation  technique.  With  this  technique,  when  there  is 
overplotting  we  cause  the  screen  view  to  scintillate  between  the 
colors  representing  the  overplotted  points.  The  speed  of 
scintillation  is  is  proportional  to  the  number  of  points 
overplotted  and  by  carefully  tracing  colors,  one  can  follow  an 
individual  point  through  the  entire  diagram. 

We  have  found  painting  to  be  an  extraordinarily 
effective  technique  in  parallel  coordinates.  We  have  a  painting 
scheme  that  not  only  punts  all  lines  within  a  given  rectangular 
area,  but  also  all  line  lying  between  to  slope  constraints.  This 
is  very  effective  in  separating  clusters.  We  also  use  invisible 
paint  to  eliminate  observation  points  from  the  data  set 
temporarily.  This  is  a  natural  way  of  doing  a  subset  selection. 
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GRAPHICAL  REPRESENTATIONS  OF  MAIN  EFFECTS  AND  INTERACTION 
EFFECTS  IN  A  POLYNOMIAL  REGRESSION  ON  SEVERAL  PREDICTORS 

William  DuMouchel,  BBN  Software  Products  Corporation 
Abstract  regression  of  xi  on  the  remaining  J-1  predictors,  rather  than 


The  table  of  coefficients  from  a  polynomial  regression 
analysis  having  several  predictors  is  hard  to  interpret  because 
its  focus  is  on  the  terms  in  the  fitted  equation,  rather  than  on 
the  variables  used  to  define  those  terms.  Methods  for 
graphically  comparing  the  effects  of  each  predictor  to  each 
other  and  to  the  residuals  are  introduced  and  discussed.  The 
techniques  are  easy  to  implement  and  to  interpret,  and  have 
been  generalized  to  provide  graphical  summaries  of  interaction 
effects. 

1.  Introduction 

Partial  residual  plots  (also  known  as  component-plus- 
residual  plots)  are  useful  diagnostic  tools  in  multiple 
regression  analysis.  Mallows  (1986)  discusses  them  and 
suggests  an  extension  of  the  technique,  which  he  calls  an 
augmented  partial  residual  plot,  designed  to  reveal  a  nonlinear 
effect  in  a  regression  model.  This  paper  introduces 
generalizations  of  such  plots  which  are  designed  to  help  a  data 
analyst  interpret  the  fit  to  an  arbitrary  response  surface  model 
(RSM),  a  regression  equation  in  the  form  of  a  polynomial  in 
several  variables.  This  new  technique,  called  an  adjusted-Y 
plot,  can  also  be  used  to  help  diagnose  nonlinearity  of  a 
regression  function  with  respect  to  one  of  the  predictors,  and 
in  fact,  if  the  regression  model  being  fitted  is  additive  and 
linear  in  the  predictors,  the  adjusted-Y  plot  reduces  to  the 
paniai  reswuai  plot.  However,  the  adjusted-Y  plot  is  useful 
for  an  arbitrary  polynomial  RSM,  and  the  emphasis  of  this 
technique  is  not  so  much  to  diagnose  nonlinearity  as  to 
visualize  the  nonlinearity  which  has  already  been  incorporated 
into  the  RSM,  with  a  secondary  goal  of  diagnosing  deviations 
from  the  assumptions  of  the  RSM.  The  adjusted-Y  plot  is 
especially  useful  as  the  foundation  for  other  graphical 
techniques  for  comparing  the  effects  of  the  different  predictor 
variables  in  the  RSM,  and  for  helping  the  data  analyst  visualize 
the  size  and  significance  of  interaction  effects. 

Partial  residuals.  Suppose  that  a  linear  regression 
model  with  J  predictors,  of  the  form 

yi  =  bo  +  bi  Xii  +  b2  Xi2  +  •■.  +  bj  Xjj  +  Cj ;  i  =  1,  ...  ,  n, 

has  been  fit  by  least  squares,  where  the  b's  are  the  estimated 
coefficients  and  the  e's  are  the  residuals.  Suppose  it  is  desired 
to  focus  on  one  of  the  predictors,  say  X]  =  (x,!  ;  i  =  1, ...  ,  n). 
and  check  the  assumptions  of  constant  variance  and  linearity 
with  respect  to  that  predictor.  The  partial  residuals  (pr)  with 
respect  to  X|  are  defined  as 

Pfii  =y  +  bi  (Xii  -  xi)  ei ,  i  =  1 . n, 

where  y  and  Xj  are  means,  and  b)  and  the  Cj  are  taken  from 
the  full  regression.  The  plot  of  pr,i  vs  xn  has  the  advantage 
that  it  displays  both  the  signal  coming  from  xj  (the  term  bKxji 

-  iTi))  and  the  noise  (the  tenn  Cj)  as  they  occur  in  the 
regression  on  all  J  predictors.  This  plot  is  to  be  distinguished 
from  the  "added  variable"  or  "partial  regression"  plot,  which  is 
similar  to  the  partial  residual  plot  except  that  the  values  plotted 
on  the  horizontal  axis  are  the  residuals  (xji  -  Xji),  based  on  a 


the  Xii  themselves. 

Augmented  partial  residuals.  Mallows  (1986) 
suggested  that  nonlinearity  in  the  relationship  between  y  and 

xi  can  be  better  detected  by  adding  (x.:i  -  xi)^  to  the  regression 
equation  and  then  replacing  prji  by 

apni  =  y  +  bi(xii  -  xi)  +  c[(xii  -  xi)2  -  ave]  +&■, , 

where  bi  and  c  are  coefficients  and  the  e's  are  residuals  from 
the  augmented  regression  model,  and  where  ave  is  the  average 

of  (Xii  -  xi)2  in  the  sample.  The  augmented  partial  residual 
plot  is  most  effective,  compared  to  the  simple  partial  residual 
plot,  when  one  or  more  of  the  other  predictors  are  correlated 

with  the  term  (xn  -  xi)^. 

Adjusted-Y  plots.  Suppose  that  a  response  surface 
model  equauon  is  represented  as 

yj  =  F(Xii,  Xi2,  ....  Xij)  +  ei  , 

where  F  is  the  fitted  polynomial  and  the  e's  are  the  residuals 
from  the  regression.  For  any  one  of  the  predictors,  say  xi, 
define  an  adjusted-fit  function  over  the  range  of  xi  as 

fl(x)  =^S)<  F(X,  X|t2,  ...  ,  Xicj), 

and  define  an  adjusted-Y  variable  for  the  i'*'  observation  as 

y^^  =  fi(xii)  +  ei .  (2) 

As  proved  in  Section  5.2,  if  F  is  of  the  form  bo  +  bi  xi  + 
F*(x2 . xj),  then  every  y-f  =  pfii.  Also,  if  F  is  of  the 

form  bo  +  bj  xi  bu  +  F*(X2 . xj),  then  every  yj,*  = 

apri,.  The  adjusted-Y  plot  is  a  generalization  of  the  partial 
residual  and  the  augmented  partial  residual  plots  which  is 
useful  for  response  surface  models  having  arbitrary  power  and 
interaction  terms. 

2.  Example  Usage  of  the  Adjusted-Y  Plot 

2.1  Data  and  Standard  Analysis 

The  data  used  in  this  example  were  taken  from  Andrews 
and  Herzberg  (1985,  p. 355-6)  and  consist  of  measurements 
on  42  apple  trees  in  an  agricultural  experiment.  The  response 
is  the  mean  weight  (Wt)  of  mature  apples  on  each  tree,  which 
was  considered  to  be  a  proxy  for  the  relative  freedom  of  the 
apples  from  a  disease  which  shrivels  them.  Several  variables 
were  measured  and  reported  by  Andrews  and  Herzberg  (1985) 
and  by  the  original  researchers  Ratkowsky  and  Martin  (1974), 
but  this  example  will  use  just  three  of  them,  the 
concentrations,  in  parts  per  million,  of  three  minerals  in  the 
apples  of  each  tree.  The  three  concentrations,  labeled  K,  Pn, 
and  Ca  respectively,  were  used  to  form  a  response  surface 
model  to  describe  the  association  between  the  mineral 
concentrations  and  the  mean  weight  variable.  After  some 
preliminary  modeling,  an  equation  of  the  form 
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Wt  =  bo  +  b;  Pn  +  b2  K  +  b3  Ca  +  b4  Pn*K  +  bs  Pn*Ca  + 
b6K2  +  e  (3) 

was  considered  adequate  for  describing  the  data.  Figure  1 
shows  the  table  of  coefficients  and  some  related  statistics  for 
this  regression  model.  Figure  2  shows  a  scatterplot  of  the 
absolute  values  of  the  studentized  residuals  versus  the  fitted 
values  from  this  regression,  with  a  lowess  fit  to  the  points 
showing  no  pattern  indicating  a  violation  of  the  usual 
assumptions  of  regression  models.  Figure  3  shows  a 
scatterplot  of  Wt  versus  K,  with  three  different  symbols  used 
to  denote  points  falling  within  three  ranges  (low,  medium,  and 
high)  of  the  variable  Pn. 
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Figure  1.  Table  of  coefficients  and  related  output  for  the  example  data. 


Figure  2.  Absolute  residual  plot  with  lowess  curve  for  the  example  data. 


Comparing  Figure  1  with  Figure  3,  the  difficulty  of 
interpreting  the  table  of  coefficients  from  a  response  surface 
model  becomes  evident.  Although  the  scatterplot  of  the  raw 
data  in  Figure  3  seems  to  show  a  definite  relationship  between 
Wt  and  K,  none  of  the  three  terms  in  the  table  of  coefficients 
that  contain  K  as  a  factor  has  a  significant  coefficient,  although 
the  term  Pn*K  is  borderline.  In  fact,  a  casual  glance  at  the 
table  of  coefficients  is  not  enough  to  confirm  that  the  fitted 
value  of  Wt  increases  with  K,  since  complicated  comparisons 
of  the  relative  contributions  of  the  linear,  quadratic  and 
interaction  terms  are  required.  If  all  three  mineral 
concentrations  had  been  standardized  to  have  mean  0  and 
variance  1,  the  task  of  sorting  out  the  effects  of  each  mineral 
from  the  table  of  coefficients  would  be  somewhat  eased,  but 


correlations  among  the  minerals  can  prevent  an  easy 
interpretation  no  matter  how  they  are  scaled. 


Figure  3.  Scatterplot  of  Wt  vs.  K,  coded  by  range  of  Pn. 


A  closer  look  at  Figure  3  shows  that  the  predictors  K  and 
Pn  are  indeed  correlated,  since  the  pattern  of  symbols  denoting 
approximate  values  of  Pn  on  the  plot  shows  that  low  and  high 
values  of  K  tend  to  be  associated  with  low  and  high  values  of 
Pn,  respectively.  So  there  is  ambiguity  in  Figure  3;  the 
apparent  trend  of  Wt  with  K  could  be  due  partially  to 
confounding  with  the  effect  of  Pn,  and  the  apparent  linearity  of 
the  trend  could  also  be  an  artifact  of  the  confounding.  The 
scatter  of  the  points  in  Figure  3  about  this  trend  is  also 
ambiguous,  since  it  is  due  partly  to  the  error  term  from  the 
regression  and  partly  to  the  effects  of  the  other  two  predictors. 


Figure  4.  Coniour  plot  of  part  of  the  fitted  response  surface.  Plotted 
points  are  locations  of  raw  data. 


Figure  4  shows  a  contour  plot  of  the  fitted  surface  versus 
K  and  Pn  at  the  point  Ca=200.  Contour  plots  are  frequently 
used  to  study  fitted  response  surfaces,  but  they  have  some 
limitations.  Many  people  without  a  technical  background  find 
contour  plots  more  difficult  to  interpret  than  the  basic  X-Y 
plot.  There  is  no  measure  of  uncertainty  on  the  standard 
contour  plot:  no  residuals  to  show  where  there  might  be  lack 
of  fit  to  the  model,  and  no  error  bars  to  show  the  magnitude  of 
the  sampling  error  inherent  in  the  contours.  Each  contour  plot 
must  fix  all  but  two  of  the  predictors,  so  only  a  small  slice  of 
the  design  space  is  portrayed  on  contour  plots  of  models 
having  several  predictor  variables. 
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2.2  Adjusted-Y  Plots 

Figures  5  and  6  show  the  adjusted-Y  plots  for  the  variables 
Pn  and  K,  respectively.  Figure  5  shows  the  average  (over  the 
n=42  sample  points)  of  fitted  Wt  versus  Pn,  namely  the 
straight  line  fi(Pn),  with  the  residuals  from  the  response 
surface  fit  added  to  form  the  ordinates,  the  values  of  adjusted- 
Wt  defined  by  (2).  Figure  6  shows  the  average  of  fitted  Wt 
versus  K,  namely  the  parabola  f2(K),  with  the  same  residuals 
added  to  form  the  second  set  of  adjusted-Wts.  It  is  instructive 
to  compare  Figure  6  with  Figure  3.  TTie  ambiguities  of  Figure 
3  have  been  cleared  up  in  Figure  6.  The  dependence  of  Wt  on 
K  is  portrayed  in  Figure  6  clear  of  any  confusion  with  the 
effects  of  Pn  or  Ca.  The  adjusted-Y  plot  communicates  the 
magnitude  of  the  curvature  and  the  strength  and  direction  of 
the  overall  trend  more  powerfully  than  does  the  table  of 
coefficients.  Figure  6  also  makes  possible  a  comparison  of  the 
relati'  e  magnitudes  of  the  variation  due  to  K  versus  the 
unexplained  variation  in  Wt.  Figure  3  is  misleading  in  this 
comparison. 


Figure  6.  Adjuslai  rit  curve  and  adjuslcd-Y  points  for  the  predictor  K. 


Since  the  curve  in  Figure  6,  f2(K),  is  the  average  of  n=42 
parabolas,  and  since  the  model  does  contain  an  interaction  term 
between  K  and  Pn,  it  is  possible  that  for  some  values  of  Pn  the 
behavior  of  the  fitted  function  will  be  quite  different  from  that 
of  f2(K).  But  the  average  behavior,  at  least,  is  easily 
visualized,  standardized  to  the  distribution  of  the  other  two 
predicto...  in  the  ;>ample.  And  any  K-regions  of  lack  of  fit  of 
the  points  to  the  model  are  easily  identified.  In  Section  4  a 
method  of  displaying  the  interaction  effects  in  response  surface 
models  is  described. 


3.  Standard  Errors  for  Adjusted-Fits  and 
for  Average  Effects 

3.1  Development  of  Standard  Errors. 

We  now  shift  attention  from  the  adjusted-Y  values,  which 
are  interpreted  much  like  partial  residuals,  to  the  adjusted-fit 
curves.  If  the  fitted  response  surface  F  is  linear  and  additive  in 
every  predictor,  and  contains  a  constant  term,  then  the  j‘h 
adjusted-fit  curve  is  just 

fj(x)  =y+bj(x-Xj), 

where  bj  is  the  coefficient  of  Xj  in  the  multiple  regression.  In 
this  case  the  variance  of  fj(x)  would  be  estimated  by 

i2p-+V(bj)(x-Xj)^ 

where  mse  is  the  mean  squared  error  of  the  residuals  and  V(bj) 
is  the  usual  estimate  of  the  variance  of  the  regression 
coefficient,  based  on  the  inverse  of  the  X'X  matrix  from  the 
regression. 

In  the  general  case  of  a  polynomial  response  surface  in  J 
variables,  the  j'*!  adjusted-fit  curve  is  a  polynomial  of  degree  pj 

where  Pj  is  the  largest  power  of  xj  which  occurs  in  F.  The 
coefficients  Bjk  are  linear  combinations  of  the  b's.  the 
coefficients  of  F.  For  example,  using  the  RSM  of  (3), 

f2(l^)  ~  ^^20  B21  K  +  B22 

B20  =  bo  +  b)  Pn  +  b3 Ca  +  bs  Pn*Ca  , 

B21  =  b2  +  b4  Pn  , 

B22  =  bs  . 

where  the  constants  Pn  ,  Ca  and  Pn*Ca  are  averages  of  the 
three  corresponding  terms  over  the  n  sample  points.  Thus,  if 
b  is  the  vector  of  least  squares  coefficients  and  if  Bj  is  the 
vector  (Bjo,  Bji, ...  )',  then  there  is  a  matrix  Aj,  with  elements 
formed  from  averages  of  predictor  variable  tenns,  which 
transfomis  b  to  Bji 

Bj  =  a,  b  , 

C,  =  estimated  covariance  matrix  of  Bj , 

=  A|(X'X)  ’  AJ  mse. 

3.2  Confidence  Intervals  for  Adjusted  Effects  of 
Variables 

Once  the  covariance  matrix  of  each  Bj  is  available,  it  is 
easy  to  obtain  standard  errors  and  confidence  intervals  for  the 
functions  fj(x)  at  any  point  x.  The  curves  in  Figures  5  and  6 
could  have  error  bars  or  even  upper  and  lower  confidence 
curves  drawn  about  them  on  the  figures.  Such  confidence 
intervals  may  not  often  be  useful,  since  the  height  of  the 
adjusted  fit  curve  at  any  point  is  not  a  predicted  response  at 
any  particular  design  point,  but  is  instead  an  average  of 
predictions  at  n  design  points.  A  more  useful  application  of 
these  covariances  is  for  the  computation  of  confidence 
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intervals  for  contrasts  based  on  the  difference  of  two  values, 
f|(x)  -  fj(x'). 

As  an  example,  look  at  the  adjusted-fit  curve  in  Figure  6, 
f2(K).  Within  the  range  of  the  data  the  minimum  value  of  f2 
is  f2(7240;=51,  and  the  maximum  value  of  f2  is 
f2(12910)=l  16.  So  116 -51  =  65  is  the  estimated  increase  in 
Wt  when  K  changes  from  7240  to  12910,  adjusted  for  all  the 
other  predictors.  A  confidence  interval  about  this  difference  is 
derived  as  follows: 

Define  x  as  the  row  vector  (1,  x,  x^,  ...  ),  and  define  x’ 
analogously.  Then 

fj(x)  -  fj(x')  =  (x  -  x')  Bj  , 

Vj  =  estimated  variance  of  [fj(x)  -  fj(x')| , 

=  (x  -  x')  Cj  (x  -  x')' . 

Therefore  a  confidence  interval  for  the  increase  in  mean 
response  associated  with  changing  the  j^*’  variable  from  x  to  x' 
is 

fj(x)  -  fj(x')  ±  t(df,  1-0/2)  '17,  ,  (4) 

wheie  t(df,  Ta/2)  is  a  tabled  student's-t  percentile  with 
degrees  of  freedom  equal  to  the  degrees  of  freedom  of  mse. 

Figure  7  graphs  these  confidence  intervals  for  the  effects 
of  the  three  predictor  variables  in  the  RSM  for  the  apple  data. 
In  each  case,  the  values  of  x  and  x'  used  are  the  maximum  and 
minimum  values,  respectively,  of  the  predictor  in  the  sample. 
(Section  5.1  provides  the  rules  for  choosing  x  and  x'  in 
general,  and  also  discusses  the  choice  of  tabled  percentile  for 
the  width  of  the  interval.) 


Figure  7.  Effccis  graph  based  on  the  adju-sled-fit  curves. 


Compare  Figure  7  with  Figure  1,  the  table  of  coefficients, 
as  summaries  of  the  RSM  analysis.  The  information  in  Figure 
7  is  tremendously  more  accessible.  You  can  see  at  a  glance 
that  Pn  has  a  negative  effect  of  about  55  grams,  K  has  a 
positive  effect  of  about  65  grams,  Ca  has  a  negative  effect  of 
about  25  grams,  and  that  all  three  effects  are  statistically 
significant.  (The  word  "effect"  as  used  here  is  not  intended  to 
imply  a  causal  effect,  merely  an  assextiated  change  in  mean 
response.)  The  table  of  coefficients  is  quite  opaque  by 
comparison.  It  io  practically  impossible  to  tell  which  variables 
have  effects  in  which  directions  without  elaborate  calculation, 
much  less  gauge  the  relative  significance  of  the  three 
predictors.  The  problem  with  the  table  of  coefficients  as  a 


summary  of  the  analysis  is  that  it  focuses  on  the  terms  of  the 
model,  not  on  the  variables  of  the  model. 

Figure  4,  a  contour  plot  of  the  fitted  RSM,  although  it  does 
focus  on  the  variables,  is  still  much  less  effective  than  Figure  7 
as  a  summary  of  the  analysis.  Figure  4  gives  no  information 
about  the  effect  of  Ca,  and  no  information  about  the  statistical 
significance  of  any  of  the  effects.  And  it  is  just  plain  harder  to 
read. 

This  is  not  to  say  that  you  should  never  look  at  tables  of 
coefficients  or  contour  plots  of  RSM  fits,  just  that  the  graph  of 
effects  as  here  defined  is  a  valuable  addition  to  the  statistician's 
toolbox,  especially  in  conjunction  with  the  adjusted-Y  plots 
discussed  previously  and  the  interaction  graphs  d- '.cussed  in 
the  next  section. 

4.  Interaction  Graphs 

4.1  The  Bivariate  Adjusted-Fit  Function 

In  order  to  explore  the  interaction  inherent  in  a  fitted  RSM 
equation,  we  extend  the  definition  of  the  adjusted-fit  function 
of  (1)  to  the  bivariate  case.  Suppose  we  are  interested  in  the 
fitted  relationship  as  a  function  of  two  of  the  predictors,  say  X] 
and  X2,  after  adjusting  for  all  other  predictors.  As  before,  let 
F(xj,  X2,  X3, ...  ,  X;)  be  the  fitted  RSM  equation,  and  define 

fi2(x,  z)  =  F(x.  z,  Xk3 . Xkj). 

In  the  case  of  the  example  model  (3). 

f,2(Pn,  K)  =  (bo+  b3  Ca)  -b  (bi  -i-  bs  ea)Pn  -t-  b2  K  + 
b4Pn*K  -b  bfcK^  . 

As  in  the  case  of  the  univariate  adjusted-fit  function,  the 
ctiefficients  of  fi2  are  simple  functions  of  the  coefficients  of  F 
and  certain  moments  of  the  predictors  which  are  being 
averaged  out.  It  is  similarly  straightforward  to  compute  the 
variance  of  f|2(x,  z)  at  any  value  of  (x,  z),  or  the  variance  of 
any  difference  of  the  form  [fi2(x,  z)  -  fi2(x',  z'))  for  any  pairs 
of  values, 

4.2  Displaying  Interaction  Effects 

Figure  8  shows  how  the  effect  of  an  interaction  term  in  a 
RSM  can  be  displayed  in  a  granh  analogous  to  the  effects 
graph  of  Figure  7.  The  top  bar  in  Figure  8  repeats  the  top  bar 
in  Figure  7,  a  confidence  interval  for  the  effect  of  Pn,  namely 
fi(3280)  -  f|(149()).  The  next  three  bars  in  Figure  8  display 
confidence  intervals  for  fi2(3280,  K)  -  fi2(1490,  K),  for  three 
values  of  K.  That  is,  the  same  contrast  in  Pn  is  repeated 
assuming  K  is  fixed,  for  various  values  of  K.  By  comparing 
these  three  intervals,  you  can  see  the  direction,  magnitude,  and 
significance  of  the  interaction  between  Pn  and  K  in  their  effect 
on  Wt,  as  measured  by  the  RSM.  Since  the  midpoints  of  the 
intervals  move  to  the  right  as  K  increases,  the  interaction  is 
positive.  The  magnitude  of  the  interaction  is  about  the  same  as 
the  mai-'  •‘ffect  of  Pn,  since  at  the  largest  value  of  K  the  effect 
of  Pn  is  almost  exactly  0,  while  at  the  minimum  value  of  K  the 
effect  of  Pn  is  about  double  its  average  value.  And  the 
interaction  is  on  the  borderline  of  being  statistically  significant, 
since  the  confidence  intervals  for  the  effect  of  Pn  at  the  high 
and  low  values  of  K  barely  overlap.  (In  this  case  the 
judgement  of  statistical  significance  is  merely  approximate. 
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since  the  overlapping  of  the  two  confidence  intervals  does  not 
rule  out  finding  a  significant  difference.  But  an  approximate 
indication  of  the  sampling  error  is  clearly  communicated  by  the 
graph.) 
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Figure  8.  Inleraclion  graph  based  on  the  bivariate  adjusted-fii  funcuon. 


The  bottom  three  confidence  intervals  in  Figure  8  provide 
the  dual  interpretation  of  the  interaction  between  Pn  and  K. 
First  the  main  effect  of  K,  measured  as  f2(12910)  -  f2(7240), 
is  graphed  exactly  as  in  the  middle  interval  of  Figure  7.  Below 
it  are  confidence  intervals  for  [fi2(Pn,  12910)  -  fuCPn, 
7240)],  for  the  extreme  values  of  Pn  in  the  sample. 
Comparison  of  these  intervals  leads  to  the  same  interpretation 
as  before,  but  with  emphasis  on  how  the  effect  of  K  changes 
as  a  function  of  Pn,  rather  than  vice-versa. 


As  a  device  for  visualizing  interaction.  Figure  8  has 
advantages  over  Figure  4,  the  contour  plot.  In  order  to  figure 
out  the  direction  of  the  interaction  from  the  contour  plot,  you 
can  notice  that  the  contours  are  more  closely  spaced  in  the 
vertical  (K)  direction  where  Pn  is  large  than  where  Pn  is  small. 
This  indicates  that  K  has  a  greater  effect  when  Pn  is  large  than 
when  Pn  is  small.  But  perceiving  the  magnitude  of  the 
interaction  from  Figure  4  is  even  more  difficult,  while  there  is 
no  indication  at  all  of  statistical  significance. 

The  table  of  coefficients  in  Figure  1,  on  the  other  hand, 
does  display  the  direction  and  significance  of  the  Pn*K  term 
(p=.06),  but  the  magniiude  of  the  interaction  effect  compared 
to  the  other  effects  is  hard  to  see  from  the  table  alone.  And  if 
the  model  were  expanded  to  contain  cubic  terms  like  Pn*K^, 
then  even  the  significance  of  the  interaction  could  become 
obscured  in  the  table  by  the  correlations  between  the  various 
terms  of  the  model. 


One  frequently  recommended  method  for  visualizing  the 
interaction  between  two  factors  on  a  response  is  to  graph  the 
response  versus  one  of  the  factors  separately  for  different 
levels  of  the  other  factor.  The  scatterplot  in  Figure  i  is  such  a 
plot,  but  the  plot  in  Figure  9  better  illustrates  the  idea  by 
overlaying  smoothed  lowess  curves  over  each  of  the  three  .sets 
of  points.  The  curve  based  on  the  largest  values  of  Pn  is 
steepest,  confirming  the  interaction  effect  we  have  been 
studying.  This  method  of  displaying  interaction  is  particularly 
effective  if  the  data  come  from  a  two-level  orthogonal 
experimental  design,  since  the  plot  then  consists  of  just  a  pair 
of  straight  lines,  and,  if  the  design  has  re.solution  at  least  5.  the 
interaction  is  not  confounded  with  effects  t'f  other  variables. 
The  interaction  graph  of  Figure  8  complements  that  of  l  igure 


9.  Figure  8  is  model-based  while  Figure  9  is  an  exploratory 
graph  based  on  the  raw  data.  Figure  8  contains  more  precise 
information  on  the  extent  and  significance  of  the  interaction, 
while  Figure  9  displays  response  values  directly,  rather  than 
being  based  on  differences  of  responses. 


Figure  9.  Scatterplot  as  in  Figure  3,  with  lowess  curves  added. 

5.  Discussion 

This  section  discusses  several  issues  related  to  the 
implementation  of  these  methods,  and  concludes  with  a  proof 
that  the  adjusted-Y  plot  is  equivalent  to  the  partial  residual  plot 
when  the  model  is  additive  with  respect  to  the  selected 
predictor. 

5.1  Implementation  Issues 

In  order  to  create  the  special  plots  introduced  here,  a 
multiple  regression  program  must  have  data  structures  that 
enable  the  system  to  analyze  each  temi  of  the  RSM,  and  to 
determine  which  variables  are  involved.  The  Mulreg  program, 
from  BBN  Software  Products  Corporation,  lets  the  user 
specify,  fit,  and  compare  different  models;  the  definition  of  a 
model  includes  a  list  of  terms  containing  such  information. 
Using  this  information,  it  is  relatively  simple  to  son  the  terms 
and  compute  the  adjusted-fit  functions  by  calculating  the  B's 
from  the  b’s  and  certain  sample  moments,  as  discussed  in 
Section  .3.1.  The  confidence  intervals  for  the  effects  and 
interaction  graphs  can  then  be  computed  as  described  in 
Section  3.2.  The  following  paragraphs  discuss  the  rationale 
for  several  of  the  choices  which  the  system  makes  in  forming 
these  confidence  intervals. 

Choice  of  comparison  values  in  the  effects 
graph.  The  effects  graph  of  Figure  7  is  based  on  three 
adjusted-fit  curves,  one  for  each  predictor.  The  center  of  the 
ji^'  confidence  interval  is  of  the  form  fj(.\)  -  fj(.x'),  where  x  and 
x'  are  chosen  .separately  for  each  j  so  that: 

( 1 )  X  and  x'  are  within  the  sample  range  of  Xj,  and 

(2)  the  absolute  difference  Ifj(x)  -  f|(x')l  is  maximized,  and 

(3)  if  Xj  is  measured  on  a  numerical  scale,  x'<x. 

If  fj(.x)  is  linear,  these  constraints  imply  that  x'  =  mintX],. 

...  .  Xnj),  X  =  max(X]j . Xn,).  If  f/x)  is  quadratic,  the 

system  detennines  the  extreme  point  of  f,  as  x"  =  -Bp  /  213,;. 
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If  x"  <  min(xij,  ....  Xnj)  or  if  x"  >  maxfxij . x„j),  then  x 

and  x'  are  chosen  as  for  the  linear  case.  Otherwise  x"  replaces 

either  minfxij,  ,  Xnj)  or  max(xij . Xnj),  so  that 

condition  (2)  above  is  satisfied.  If  fj(x)  is  a  cubic  or  higher 
degree  polynomial,  then  the  system  evaluates  fj(x)  at  minfxij, 
...  ,  Xnj)  and  maxfxij,  ....  Xpj)  and  at  nine  equally  spaced 
points  in  between,  and  then  chooses  x  and  x'  to  maximize 
Ifj(x)  -  fj(x')l  from  among  these  eleven  points,  resulting 
sometimes  in  an  approximate  maximization  of  Ifj(x)  -  fj(x')l. 

If  Xj  is  a  categorically  scaled  variable,  x  and  x'  are  the  two 
categories  having  the  extreme  values  of  fj. 

The  interaction  graph  described  in  Section  4.2  and  shown 
in  Figure  8  repeats  the  comparisons  of  the  effects  graph  for 
two  variables  which  share  one  or  more  interaction  terms  in  the 
model.  The  choice  of  points  for  the  second  variable  at  which 
the  contrasts  for  the  first  variable  are  repeated  is  made  as 
follows;  If  the  second  variable  is  categorically  scaled,  the 
contrast  is  repeated  at  every  level  of  the  variable.  If  the  second 
variable  is  continuous  but  enters  the  model  only  linearly,  the 
contrast  is  repeated  only  at  its  minimum  and  maximum  values. 
If  the  model  contains  higher  powers  of  the  variable,  the 
contrast  is  repeated  at  the  minimum,  maximum,  and  midrange 
of  its  values  in  the  sample. 

Simultaneous  confidence  intervals.  The  confidence 
intervals  within  a  single  effects  graph  or  interaction  graph  are 
not  joint  confidence  intervals.  The  stated  degree  of  confidence 
penains  to  each  interval  separately.  However,  if  a  particular 
adjusted-fit  function  has  more  than  one  degree  of  freedom  for 
contrasting  x-values,  as  will  happen  if  the  function  of  a 
continuous  variable  is  quadratic  or  higher  order,  or  if  a 
categorically  scaled  predictor  has  three  or  more  categories, 
then  the  Mulreg  program  adjusts  the  confidence  interval  to 
account  for  the  post-hoc  manner  in  which  x  and  x’  are 
selected. 

If  fj(x)  is  a  polynomial  of  degree  p,  then  the  Scheffe 
technique  of  replacing  the  t(df,  l-a/2)  percentile  in  equation 

(4)  by  the  percentile  "v/ P  FiP-  df,  1-a)  is  used.  In  the  case 
of  a  categorically  scaled  predictor,  a  Bonferonni  adjustment  is 
made:  the  t(df,  1-0/2)  percentile  is  replaced  by  the  percentile 
t(df,  l-a/m{m-l)),  where  m  is  the  number  of  categories  being 
compared. 

Confidence  intervals  in  the  interaction  graph  described  in 
Section  4.2  are  computed  using  the  same  tabled  critical  values 
as  the  effects  graph  of  the  same  contrast.  In  Figures  7  and  8, 
the  intervals  displaying  contrasts  with  respect  to  K  use  the 
Scheffe  method  with  2  degrees  of  freedom,  while  the  other 
intervals  use  the  percentile  t(df,  1-0/2),  since  the  fitted 
function  is  linear  in  Pn  and  Ca. 

Choice  of  error  term.  Mulreg  usually  uses  the  mean 
squiired  error  of  the  residuals  (mse)  in  the  computation  of  the 
confidence  intervals  for  the  effects  graph  and  the  interactions 
graph,  as  discussed  in  Section  3.  There  are  three 
circumstances  in  which  another  quantity  is  substituted  for  the 
residual  mean  squared  error  in  the  formulas. 

First,  if  the  data  for  the  multiple  regression  contains 
replications,  the  system  computes  a  mean  square  for  "pure 
error"  based  on  the  response  variation  within  groups  of  points 


having  the  same  set  of  x-values.  The  confidence  intervals  in 
the  Mulreg  effects  graphs  and  interactions  graphs  use  the  pure 
error  mean  square  instead  of  the  residual  mean  square 
whenever  there  are  at  least  four  degrees  of  freedom  for  pure 
error  and  the  usual  F-test  for  lack  of  fit  is  significant  at  the 
10%  level. 

Second,  if  a  model  contains  an  interaction  between  a  fixed 
effect  and  a  random  effect,  then  the  confidence  interval  for  the 
contrast  of  levels  of  the  fixed  effect  will  use  the  interaction 
mean  square  rather  than  the  mean  square  residual. 

Third,  if  a  robust  bisquare  regression  is  being  used  rather 
than  a  least  squares  fitting  algorithm,  the  robustly  estimated 
coefficients  and  a  robust  version  of  the  mean  square  error  are 
substituted  into  the  formulas. 

Transformations.  If  the  response  variable  has  been 
transformed,  the  adjusted-Y,  the  effects  graph,  and  the 
interaction  graph  are  all  computed  and  displayed  on  the 
transformed  metric.  In  order  to  make  these  graphs  more 
interpretable  in  such  cases,  the  program  can  use  a  "matched" 
scaling  of  the  response  transformation,  as  recommended  by 
Hoaglin  et.  al.  (1983,  section  4E). 

5.2  Proof  of  Equivalence  to  Partial  Residuals 
Suppose  that  X]  enters  the  model  additively,  so  that  the 
fitted  least  squares  model  is  of  the  form 

yi  =  Fi(xn) -t- F*(xi2 . Xjj)  +  ei,  i=l . n. 

Then,  using  (1)  and  (2),  the  adjusted-Y  values  with  respect  to 
the  first  predictor  variable  are 

y"*^  =  Fi(xii)  -t-F*  +ei, 

while  the  corresponding  partial  residuals  (which  are 
augmented  partial  residuals  if  Fj  is  not  linear)  are 

Pri  =y  +  (Fi(Xi,)  -  Fi) -I- e.. 

Comparing  these  formulas,  we  note  that  they  are  equal  if 

y  =  F)  +  F*  , 

which  is  the  requirement  that  the  average  of  the  fitted  values 
from  the  regression  equals  the  average  value  of  the  response. 
As  is  well  known,  this  will  be  true  whenever  a  least  squares 
regression  model  contains  a  constant  term,  or  whenever  some 
linetu'  combination  of  the  predictor  tenns  is  constant  for  all  n 
cases.  If  the  model  cannot  be  renarametrized  to  contain  a 
constant  tenn,  then,  depending  on  how  the  partial  residuals  are 
defined,  they  may  differ  by  a  constant  amount  from  the 
adjusted-Y  values. 
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ABSTRACT 

Stochastic  optimization  procedures 
have  been  shown  to  be  efficient  methods 
for  finding  global  extrema  of  objective 
functions.  In  this  article  we  report 
computational  results  obtained  using  the 
generalized  simulated  annealing  method 
on  a  set  of  standard  global  optimization 
test  problems.  The  results  are  compared 
to  those  obtained  using  a  self- 
regulating  mechanism  which  chooses  a 
random  step  distribution  based  on  the 
local  topography  and  the  currently 
specified  annealing  temperature. 

INTRODUCTION 

The  problem  of  finding  the  global 
extremum  (assumed  to  be  a  minimum  here) 
of  a  real-valued  function  has  been  an 
importan-t  one  for  a  long  time.  There 
has  been  a  recent  increase  in  interest 
in  solving  global  optimization  problems 
using  stochastic  methods  which,  though 
computationally  intensive,  are  efficient 
because  of  the  increased  speed  of  compu¬ 
tation  now  available.  These  methods 
combine  some  form  of  sampling  (usually 
random)  and  local  search  procedures. 

The  better-known  stochastic  optimization 
methods  have  some  very  attractive  beha¬ 
vioral  properties  and  have  proved  to  be 
efficient  search  procedures  over  a  wide 
range  of  objective  function  topogra¬ 
phies,  including  problems  with  high 
dimensionality  and  multiple  extrema. 

A  stochastic  method  based  on  the 
simulation  of  the  cooling  of  a  liquid 
substance  was  shown  to  be  useful  for 
function  optimization  by  Kirkpatrick, 
Gelatt,  and  Vecchi  (KGV)(1983).  Called 
"simulated  annealing,"  the  method  has 
proven  to  be  very  useful  for  solving 
large  combinatorial  problems  as  documen¬ 
ted  by  KGV  and  others,  including  NP-hard 
problems  like  Bonomi  and  button's  (1984) 
work  with  the  traveling  salesman  prob¬ 
lem.  The  method  also  has  attractive 
theoretical  prcpertits  for  discrete 
spaces;  Lundy  and  Mees  (1986),  Hajek 
(1986),  and  Geman  and  Geman(1984)  all 
prove  convergence  of  the  algorithm  under 
various  assumptions  on  classes  of  NP- 
hard  problems.  The  extensive  biblio¬ 
graphy  compiled  by  Golden  (to  appear  in 
1988)  contains  numerous  references  of 
applications  to  combinatorial  problems. 

Simulated  annealing  applied  to  func¬ 
tions  of  continuous  variables  behaves 
much  like  a  random  walk  with  a  bias,  but 
lacks  reasonable  convergence  behavior  in 
many  applications.  Certain  modifica¬ 
tions  can  be  made  to  hasten  the  method's 
convergence,  such  as  the  stepwise  para¬ 
meter  adjustment  of  Vanderbilt  and  Louie 


(1984),  who  report  solution  times  for 
their  method.  The  study  reported  here 
investigates  the  behavior  of  the  gener¬ 
alized  simulated  annealing  (GSA)  method 
introduced  by  Bohachevsky,  Johnson,  and 
Stein  (1986),  which  uses  the  current 
value  of  the  function  to  control  the 
random  process;  various  aspects  of  the 
method's  behavior  over  a  range  of  test 
problems  with  continuous  variables  are 
shown.  Section  2  presents  the  simulated 
annealing  algorithm  and  its  generaliza¬ 
tion,  and  Vanderbilt  and  Louie's  self¬ 
regulating  simulated  annealing  (SRSA) 
algorithm.  Section  3  gives  the  results 
of  its  application  to  the  test  problems, 
and  Section  4  gives  a  summary  with 
comments . 

APPROACHES  TO  SIMULATED  ANNEALING 

"Annealing"  refers  to  the  process  in 
which  a  substance  is  first  melted,  then 
the  temperature  is  lowered  slowly.  The 
substance  is  allowed  to  spend  a  lot  of 
time  at  temperatures  near  the  freezing 
point  of  the  substance,  thereby  allowing 
the  atoms  in  the  substance  to  arrange 
themselves  into  configurations  with  the 
lowest  potential  energy.  The  desire  is 
to  achieve  a  "ground  state"  (lowest 
potential  energy)  arrangement  of  atoms 
at  each  temperature.  This  ground  state 
configuration  occurs  when  the  potential 
energy  (function)  is  at  its  global  mini¬ 
mum  for  all  possible  arrangements  of 
atoms  at  that  temperature.  KGV  give  an 
interesting  history  of  how  Metropolis  et 
al.  (1953)  developed  an  algorithm  to 
simulate  this  annealing  process  for  any 
particular  substance.  Starting  with  a 
substance  with  an  arrangement  of  atoms 
at  potential  energy  E,  the  Metropolis 
algorithm  simulates 

a  new  arrangement  of  atoms 
resulting  in  a  change  in 
energy,  denoted  AE; 

If  A  E  is  negative,  accept 
the  arrangement  by  letting 
that  be  the  new  arrangement 
of  atoms  for  the  substance; 

If  AE  is  positive,  accept 
the  arrangement  with 
probability  of  exp(-AE/KgT) 

where  T  is  the  temperature 
of  the  substance  and  Kb  is 
the  Boltzmann  constant. 

The  simulation  "moves"  from  configu¬ 
ration  to  configuration,  following  a 
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random  walk  with  a  bias  to  lower  energy 
values,  since  the  probability  of  accep¬ 
tance  of  lower-energy  arrangements  is 
greater.  The  simulation  assumes  the 
system  evolves  into  a  Boltzmann  distri¬ 
bution. 

The  analogy  to  more  general  applica¬ 
tions  is  clear:  the  energy  function  is 
any  objective  function,  the  arrangement 
of  atoms  is  the  combination  of  indepen¬ 
dent  variable  values,  and  the  rearrange¬ 
ment  of  atoms  is  equivalent  to  the 
iterative  improvement  of  function  values 
by  changing  variable  values.  The  useful¬ 
ness  of  simulated  annealing  as  a  func¬ 
tion  optimization  procedure  is  that  it 
can  move  to  detrimental  function  values 
in  its  optimization  search,  which  pre¬ 
vents  it  from  being  trapped  in  local 
minima.  In  addition,  the  implementation 
of  the  search  for  the  global  minimum 
does  not  require  any  derivatives,  only 
function  evaluations,  making  it  both 
analytically  and  computationally  conve¬ 
nient  . 

This  standard  annealing  method  is 
handicapped  in  function  optimization, 
however,  because  there  is  no  "cooling" 
(referred  to  as  an  annealing  schedule); 
that  is,  the  temperature  of  the  subs¬ 
tance  remains  fixed  and,  therefore, 
excessive  numbers  of  moves  are  made  in 
searching  for  minimum-energy  configura¬ 
tions.  The  generalized  simulated 
annealing  method  provides  a  gradual 
(though  not  necessarily  monotonic)  dec¬ 
line  of  temperature  values  thereby  redu¬ 
cing  the  probability  of  acceptance  of  a 
higher-energy,  and  detrimental,  point  as 
the  function  values  approach  the  (esti¬ 
mated  or  known)  global  minimum  of  the 
function.  This  is  achieved  by  automati¬ 
cally  setting  the  acceptance  probability 
according  to  the  function  topography. 

The  change  in  position  is  governed  by  a 
specified  acceptance  probability  which 
depends  on  the  parameters  of  an  accep¬ 
tance  probability  function.  This  func¬ 
tion  decreases  the  probability  of  moving 
to  a  new  location  as  the  algorithm  prog¬ 
resses  . 

Simulated  annealing  has  an  exponen¬ 
tial  acceptance  probability  function  so 
that  the  probability  of  moving  from  the 
location  at  the  i-th  function  evaluation 
to  the  new  location  corresponding  to  the 
(i+l)-th  evaluation  is 
p^  =  exp(f^*(fj^  -  f.^^)*K) 

This  was  generalized  to 

p^  =  exp(f?*(fj;  -  fi  +  i)*K) 

Although  any  g  £  0  can  be  used,  this 
investigation  considers  only  g  =  -1. 
Standard  simulated  annealing  is 
recovered  from  this  generalization  by 
setting  g  =  0  and  using  a  predetermined 
set  of  values  for  K  and  a  predetermined 
number  of  function  evaluations  at  each 


K.  Vanderbilt  and  Louie  set  g  =  0  and 
use  an  indexed  set  of  coefficients  for  K 
to  force  p^  to  approach  0.  In  addition, 
they  suggested  a  method  for  self¬ 
regulating  the  determination  of  the  step 
size  and  the  step  distribution. 

The  GSA  algorithm  can  be  summarized 
using  the  following  notation.  Let  F  (x) 
be  the  real-valued  function  of  interest 
evalxated  at  point  x,  an  element  of  some 
bounded  subset  of  Let  Z  be  the 

global  mi.iimum  value  of  F  and  let  x^  be 
the  initial  set  of  independent  variable 
values.  The  algorithm  proceeds  by: 

1.  Selecting  x  (randomly  or 
based  on  other  available 
information)  and  computing 


2.  If  this  value  is  close 

enough  to  Z,  stop;  otherwise 


3.  Choose  a  direction  from  the 
uniform  distribution  on  the 
unit  hypersphere  centered  at 
Xq.  Generate  unit  direction 
for  U: 

2  2  2  1  /  o 

Ui  =  Y^/  (Y^  +  Yj  +  ...  +  Yn'  ' 

i  =  1,  ...  ,  n 


v/here  Y-  is  a  standard 
normal  deviate. 


4.  Choose  a  step  size  ^r  and 
determine  a  new  set  of 
variable  values 

X  =  Xq  +  Ar*U 

5.  If  X  is  not  in  the  bounded 
support  of  F,  generate  a  new 
x;  otherwise, 

6.  If  F(x)  <  Fj^,  accept  x  by 
setting  x^^  =  x  and  F^  =  F(x). 

7.  If  F(x)  >  Fq  ,  accept  x  with 
probabil i ty 

p  =  exp(-Beta*  (F (x) -F^) /Fq) 
where  Beta  is  a  preset 
parameter.  Otherwise, 
generate  a  new  step. 

8.  Continue  this  random  walk 
until  I Fp  -  Z|  <  e  ,  some 
arbitrary  specified 
precision. 

To  use  the  algorithm  to  search  for 
the  optimum  of  a  function,  it  remains  to 
set  the  parameters  r  (the  step  size] 
and  Beta  [analogous  to  1/ (temperature  of 
the  system)].  A  large  Beta  causes  less 
movement  than  a  small  Beta.  This  is 
typically  done  by  trial  and  error.  The 
practical  considerations  are  to 

1.  select  Beta  so  that  the 
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probability  of  accepting 
detrimental  points  is  not 
too  small  (the  algorithm  can 
not  escape  local  extrema)  or 
too  large  (totally  random 
walk) ; 

2.  select  A r  so  that  the 
probability  of  exiting  a 
local  extremum  is  not  too 
small  (in  which  case  the 
algorithm  gets  stuck  too 
easily)  or  too  large  (leaves 
all  extrema,  including  the 
global),  given  Beta. 

In  practice,  the  algorithm  appears  to 
perform  best  when  about  60%  (and  between 
50%  and  90%)  of  the  detrimental  moves 
are  accepted.  The  performance  of  the 
algorithm  for  a  selected  step  size  is 
influenced  most  by  (1)  the  variability 
in  the  topography  (values  of  F)  over  the 
support  of  F,  (2)  the  range  of  the  sup¬ 
port,  and  (3)  the  number  of  dimensions. 
The  next  section  illustrates  this  prob¬ 
lem  more  specifically. 

In  addition  to  setting  the  parameters 
which  govern  the  acceptance  probability, 
a  stopping  rule  must  be  specified.  The 
most  straightforward  method,  comparable 
to  that  used  with  several  of  the  other 
stochastic  optimization  methods  is  to 
terminate  the  algorithm  after  a  speci¬ 
fied  number  of  iterations  without  a 
move.  The  results  reported  in  the  next 
section  are  based  on  using  50  iterations 
without  a  move  as  a  stopping  rule;  major 
shifts  in  this  number,  however,  did  not 
appear  to  have  a  large  impact  on  the 
resul ts . 

COMPUTATIONAL  RESULTS 

Computational  results  applying  a 
collection  of  stochastic  optimization 
methods  to  a  set  of  seven  test  functions 
were  first  collected  in  Dixon  and  Szego 
(1978),  who  proposed  the  standard  test 
functions.  Two  other  stochastic  optim¬ 
ization  methods  were  tested  on  the  same 
set  of  functions  by  Rinnooy  Kan  and 
Timmer  (1984,  1987);  the  specific  coef¬ 
ficients  for  the  test  functions  are 
given  in  Dixon  and  Szego.  In  this  sec¬ 
tion,  summary  measures  of  the  perfor¬ 
mance  of  the  GSA  algorithm  are  given  for 
the  same  set  of  functions. 

First  a  brief  discussion  of  each 
problem  is  presented  below  followed  by  a 
summary  list  of  the  problems,  and  final¬ 
ly  the  solution  results  are  presented. 

The  Goldstein-Price  (GP)  function  is 
a  two-dimensional  function  with  three 
local  and  one  global  minima.  An  inver¬ 
ted  view  of  the  function  is  given  in 
Figure  1.  It  shows  the  smoothness  of 
the  function  along  with  the  minima  . 
Figure  2  shows  ‘the  mean  number  of  eval¬ 
uations  for  various  values  of  the  param¬ 


eter  Beta  and  the  step  size.  For  any 
specified  pair  of  parameter  values,  the 
variability  in  number  of  evaluations  to 
termination  is  due  to  differences  in  the 
search  path  taken  because  of  different 
random  number  seeds.  This  variability 
can  be  substantial  in  terms  of  number  of 
evaluations,  but  for  small  problems  such 
as  this  one  the  differences  in  CPU  time 
are  negligible. 

The  Branin  (BR)  function  is  a  two 
dimensional  function  with  three  minima, 
all  global.  It  is  shown  with  an  illus¬ 
trative  search  path  in  Figure  3.  Figure 
4  shows  the  sensitivity  of  the  mean 
number  of  evaluations  to  the  two  parami- 
eters  for  several  representative  values. 

One  function  (H3)  of  the  Hartman 
family  is  a  three-dimensional  function 
with  five  minima:  four  local  minima  and 
one  global.  The  actual  global  is  not 
the  one  reported  in  Dixon  and  Szego 
(1978);  this  function  was  difficult  for 
GSA  because  the  function  is  virtually 
flat  in  one  dimension  at  the  global,  so 
the  independent  variable  is  unstable  in 
that  dimension  prior  to  termination  of 
the  algorithm.  The  global  found  in  this 
test  was  (approximately);  (.11,  .555, 

Another  function  (H6)  of  the  Hartman 
family  is  the  six-dimensional  version  of 
H3.  It  was  more  stable  at  the  global 
and,  surprisingly,  easier  for  GSA  to 
terminate  than  the  three-dimensional 
function.  This  held  true  for  a  wide 
range  of  parameter  values  and  a  large 
number  of  random  seeds,  although  this 
performance  does  not  seem  to  hold  true 
for  the  other  methods  tested.  Figure  5 
summarizes  the  mean  number  of  evalua¬ 
tions  required  on  these  test  functions 
for  some  representative  parameter 
values. 

Three  functions  from  the  Shekel  fam¬ 
ily  -  (S5),  (S7),  (SIO)  -  were  also 
tested.  This  series  of  functions  in 
four  dimensions  has  5,  7,  and  10  minima, 
respectively,  each  including  one  global 
minimum.  This  function  family  is  the 
most  difficult  for  the  GSA  method.  A 
two-dimensional  version  shown  in  Figure 
6  illustrates  the  reason:  the  depths  of 
the  local  minima  are  great  relative  to 
the  region  of  attraction  at  their 
mouths.  The  remainder  of  the  surface  is 
largely  flat  so  that  large  step  sizes 
tend  to  step  over  the  regions  of  attrac¬ 
tion  and  small  step  sizes  fall  in  the 
local  minimum  they  first  encounter  and 
are  never  able  to  escape. 

The  GSA  algorithm  was  started  from  a 
number  of  boundary  positions  and  one 
internal  position  and  Figure  7  shows 
that  the  proportion  of  search  paths 
terminating  at  each  of  the  minima  is 
proportional  to  the  depth  of  the  mini¬ 
mum,  so  that  the  largest  proportion  of 
searches  terminates  at  the  global.  This 
is  because  the  area  of  attraction  for  a 
minimum  is  proportional  to  its  depth. 


137 


GENERAL  SOLUTION  METHOD 

The  precision  of  the  solutions 
depends  on  the  step  size.  The  most 
efficient  method  for  determining  the 
global  minimum  of  a  function  with  appro¬ 
priate  precision  was  to  first  conduct  a 
global  search  with  a  larger  step  size 
proportional  to  the  volume  of  the  bound¬ 
ed  support  for  F  in  R*'.  This  phase 
proceeded  by  starting  the  GSA  algorithm 
from  several  remote  boundary  positions 
and  running  100  independent  random 
search  paths  from  each  starting  location 
(with  reasonable  parameter  values  deter¬ 
mined  by  pre-sampling  the  function)  to 
give  some  indication  of  variability  in 
solution  times  and  paths.  The  global 
phase  located  all  the  minima  in  all  the 
test  functions  (except  the  shallowest 
minimum  in  the  10-minimum  Shekel  func¬ 
tion).  GSA  always  terminated  at  a  mini¬ 
mum.  Then  the  step  size  was  adjusted 
for  precision  and  a  local  search  was 
conducted  in  the  region  of  each  of  the 
minima  found  in  the  global  search.  To 
determine  which  local  minimum  is  the 
global,  a  local  search  should  be  done  in 
each  region  identified  by  the  global 
searches.  The  GSA  algorithm  was  run  100 
times  in  each  local  region  and  found  the 
value  of  the  local  minimum  for  that 
region  for  all  runs  on  all  functions. 
Using  this  general  approach  the  GSA 
method  found  the  global  minimum  to  every 
test  function  to  any  arbitrary  precis¬ 
ion;  the  algorithm  did  not  terminate  at 
the  global  on  every  run  for  some  func¬ 
tions,  but  multiple  runs  resulted  in  the 
highest  proportion  locating  the  global 
minimum  for  all  the  functions.  Local 
searches  always  discriminated  between 
local  and  global  minima,  and  terminated 
in  the  local  regions  in  which  they 
began.  This  method  also  showed  the 
approximate  minimum  previously  given  for 
H3  to  be  incorrect. 


COMPUTATIONAL  RESULTS 

Table  1  gives  a  summary  of  the  test 
functions  described  in  this  section  and 
the  parameter  settings  used  to  reach 
solutions.  Table  2  lists  the  other 
global  optimization  methods  used  on 
these  test  problems. 

The  proportion  of  global  searches 
terminating  at  the  global  is  listed  in 
the  summary  chart  of  computational  re¬ 
sults  presented  in  Table  3.  All  other 
searches  terminated  at  a  local  minimum. 

Table  3  gives  the  number  of  function 
evaluations  to  termination  for  the  var¬ 
ious  methods  used  on  the  test  functions. 
Table  4  gives  the  same  results  in  terms 
of  standard  time  units  where  one  unit  is 
the  CPU  time  to  do  1000  evaluations  of 
S5  at  a  specified  location. 


The  results  reported  for  GSA  are  the 
average  time  over  100  trials  at  the 
parameter  values  given  in  Table  2,  star¬ 
ting  from  some  remote  boundary  point  in 
the  support  of  F.  The  test  results  for 
the  other  methods  are  the  averages  of  4 
independent  runs  and  no  variability 
measures  or  parameter  settings  are 
available . 

What  Tables  3  and  4  do  not  show  is 
the  sampling  necessary  to  determine 
reasonable  values  for  the  parameters. 
This  is  not  extensive,  given  initial 
settings  related  to  the  function  proper¬ 
ties,  but  is  a  component  of  the  solution 
process  (as  it  is  for  some  of  the  other 
methods) . 

SUMMARY  AND  CONCLUSIONS 

Computational  experience  with  the 
generalized  simulated  annealing  method 
for  small  problems  over  continuous  var¬ 
iables  indicates  that  the  use  of  GSA  may 
have  some  promise  for  problems  of  this 
type.  The  results  are  compared  to  a  set 
of  stochastic  global  optimization  meth¬ 
ods  which  represent  some  of  the  best 
alternatives  available.  GSA  appears  to 
be  competitive  in  terms  of  solution 
times  as  well  as  reliability  on  this 
class  of  problems.  The  number  of  eval¬ 
uations  should  be  interpreted  correctly; 
the  number  of  local  evaluations  must  be 
done  for  each  of  the  minima  located  by 
the  global  search.  The  disadvantage  of 
this  procedure  is  that  it  requires  a 
large  amount  of  user  interaction  re¬ 
starting  the  procedure  at  different 
points.  The  advantage  of  the  procedure 
is  that  for  small  problems  like  these, 
this  method  provides  a  microcomputer- 
based  ability  to  solve  problems  that  we 
were  not  able  to  solve  using  Eureka,  a 
commercially  available  micro-based 
steepest  descent  non-linear  optimization 
package. 

There  are  several  potential  modifica¬ 
tions  that  could  make  the  algorithm  more 
efficient.  Two  current  concerns  are  its 
sensitivity  to  the  value  of  the  paramet¬ 
ers  which  govern  the  probability  of 
acceptance  (and  to  random  seeds),  and 
its  lack  of  an  operable  stopping  rule. 

Results  of  using  the  algorithm  on 
large-dimension  cont inuous-var iabl e 
problems  (50  or  more)  have  not  been 
reported.  Although  it  has  proven  suc¬ 
cessful  on  very  large  combinatorial 
problems,  the  behavior  of  the  algorithm 
on  these  small  problems  with  continuous 
variables  indicates  that  there  appear  to 
be  some  serious  potential  problems  for 
the  algorithm  to  be  a  useful  "general 
purpose"  tool  for  solving  very  large 
problems  (over  100  dimensions,  say)  with 
continuous  variables.  The  method  is 
sensitive  to  the  parameter  values  deter¬ 
mining  the  acceptance  probability,  the 
variability  with  respect  to  the  random 
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seed  is  significant,  for  functions  that 
are  not  very  "smooth"  it  appears  to  be 
slow,  and  the  two-phase  procedure  of 
global  and  local  optimization  requires 
substantial  interaction  on  the  part  of 
the  user. 
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Table  2;  Cptitnizaticn  Methods  used  oi  Test  Problems 


Methcd 

Deacripticn 

Trajoctory  (T) 

Gradient  path  method  (Branin  and  Hcx>  (1972)) 

Ra/dom  direction 
(HD) 

Randcm  directions  (Bremmerman  (1970)) 

Controlled  random 
search  (CR) 

Price  (1978) 

Density  clustering 
(DC) 

Sample  ccncentraticn  and  clustering  {Tom(l976)) 

Density  rechicticn 
(DfU 

Density  clustering,  raducticn  and  spline 
fitting  (De  Biasi  and  Frtnt  mi  (1978)1 

Muiti-level  eiiiyle 

1  LrkAijo  (ML) 

Cluster  ing  by  distance  (Rinncoy  Kan 
and  Timmer  (1987)) 

Sel f -regul at  ing 
(SRSR) 

Annealing  with  self-adjusting  st^  determination 
(Vanderbilt  and  louie  (19B4)) 

General  ized 
sumlated 
annealing  (GSR) 
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Table  3;  Ntunber  of  Function  Evaluaticns  to 
Find  Glcbal  Minimum 

Method 


Function 
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DC 
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ML 

SRSA  f 

P 
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* 

/ 

P 

GP 

1 

300 

2500 

2499 

378 

294 

1186 

/ 

99 

170  + 
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/ 

100 

BR 
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219 

557 

/ 

100 

121  + 

115 

/ 

100 

K3 

• 

2400 

2584 
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370 

1224 

/ 

100 

310  + 

145  / 

78 

H6 

515 

7600 

3447 

807 

877 

1914 

/ 

62 

287  + 

235 

/ 

100 

S5 

1 

1  5500 

* 

3800 

3649 

620 

347 

3910 

/ 

54 

400  + 

296 

/ 

58 

S7 

1  5020 

* 

4900 

3606 

788 

399 

3421 
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64 

261  + 

296 
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47 

SIO 

1  4860 

4400 

3874 

1160 

447 


3078 
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/ 

47 

/P:  Preportion  of  trials  ending  at  the  global;  remainder  at  locale 
Ito  results  available 
*;  Failed  to  find  glcbal 

**:  Number  of  global  evaluations  plus  local  evaluations 
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3 

4 
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2 
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4 

4 

14 
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1 
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• 

8 

8 

16 

1 

4 
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H6 

3 

46 

16 

21 

3 

12 

2.0 

S5 

]  9 

• 

14 

10 

23 

.75 

16 

2.1 

S7 

1  8.5 

• 

20 

13 

20 

1 

15 

1.7 
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1  9.5 

1 

* 

20 

15 

30 

1.5 

15 

1.9 

Standard  time  unit:  CRJ  time  for  1000  function  evaluations  of  S5. 

No  results  available 
Failed  to  find  glcbal 

**:  timber  of  global  evaluations  plus  local  evaluations 


Figure  1.  GP  Function  (inverted) 
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Function  Lvaluations 


Figure  2.  Goldstein-Price 
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SIMULATED  ANNEALING  IN  THE  CONSTRUCTION  OF  EXACT  OPTIMAL  DESIGNS 
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Introduction 

Exact  optimal  design  of  experiments  is 
concerned  with  specifying  n  points  from  a 
design  space  at  which  observations  are  to  be 
taken  in  order  to  achieve  precise  estimation. 
A  linear  model  of  the  form 
Y  =  Xb  +  £ 

is  assumed,  where  Y  is  an  nxl  vector  of 
observations,  X  is  the  nxp  design  matrix,  ts  is 
a  pxl  vector  of  unknown  regression  parameters, 
and  £  is  an  nxl  vector  of  uncorrelated 
experimental  errors  with  mean  zero  and 
constant  variance  The  ith  observation  yj 

is  obtained  at  a  vector-valued  point  in  a 
q-dimensional  compact  design  space  x  ,  and 
the  corresponding  row  of  x  is  written  f'(xi). 
Tor  example,  consider  a  second  order  response 
surface  model  with  two  factors, 

f'(xi)  =  (I,xii,xi2,xii2,xi22,xilxi2). 

If  the  parameters  are  estimated  by  least 
squares,  the  variance  of  the  estimate  of  ‘i  is 
given  by  .•^(X'X)”'^  .  The  variance  of  the 

fitted  values  at  x^  is  proportional  to 
d(xi)  =  f '  (xi)(X'X)" ’^f(xi) , 
termed  the  variance  function. 

Designs  are  chosen  using  one  or  more 
optimality  criteria.  Generally  such  criteria 
are  represented  by  functionals  on  the  pxp 
covariance  matrix  (X'X)“lo^  (See  Steinberg 
and  Hunter,  1984,  for  a  review).  The  most 
widely  applied  criterion  is  D-optimality, 
first  proposed  by  Wald  (1943).  D-optimal 
designs  maximize  jX'X],  in  effect,  minimizing 
the  generalized  variance  of  the  estimated 
coefficients.  If  the  errors  (c^)  are  normally 
distributed,  the  design  minimizes  the  volume 
of  a  fixed  level  confidence  ellipsoid  for 

If  X*  is  the  design  matrix  corresponding  to 
the  D-optimal  v.esign,  the  D-efficiency  of  any 
other  n-point  design  is  given  by  100(|X'X1/ 
|X*'X'^|)^/P  .  If  the  D-optimal  design  is 

unknown,  as  is  often  the  case,  the  relative 
efficiency, 

R-efficiency  =  100(  |X  |X  ^  |  /  lX2'X2  |  )  ^‘'■’, 
is  typically  used  to  compare  n-point  designs 
having  respective  design  matrices  X[  and  X2. 

Early  efforts  in  D-optimal  design 
construction  used  mathematical  programming 
techniques  to  directly  maximize  jX'Xj  (See 
e.g.,Box,  1966).  Box  and  Draper  (1971)  used 
Powell's  direct  search  to  maximize  [X'Xj  in  up 
to  30-dimensional  space.  More  recently 

various  exchange  algorithms,  for  example, 
Mitchell's  DETMAX  (1974),  Federov  (1972), 
k-excnange  (.Johnson  and  Nachtsheim,  1983), 
reduce  the  dimension  of  the  search  space. 
These  algorithms  begin  with  a  nonsingular 
n-point  design  and  iteratively  add  a  point 
from  the  design  space  and  delete  a  point  from 
the  current  design  such  that  a  maximal 
increase  in  jX'Xj  is  obtained.  The  exchange 
of  design  points  typically  is  determined  by 
computing  optima  of  the  variance  function, 
deleting  the  point  with  minimum  variance  of 


prediction  and  adding  the  point  with  maximum 
variance.  Convergence  of  the  sequence, 
however,  may  be  to  a  locally  optimal  design. 

When  X  is  finite,  various  simplifications 
result.  For  example,  optimization  of  the 
variance  function  can  be  globally  obtained  at 
each  iteration.  Moreover,  when  there  are  N 
design  or  "candidate"  points  in  K,  there  are 
(n  -PN  -  1  )  possible  designs  (Welch,  1982), 
making  an  exhaustive  search  theoretically 
possible.  Welch  (1982)  developed  a  branch- 
and-bound  algorithm  which  guarantees  global 
exact  D-optimal  designs,  but  is 

computationally  infeasible  with  large 
dimensional  problems. 

However,  design  spaces  are  often 
represented  by  convex  regions  in  R9,  and  the 
simplifications  described  above  are  not 
applicable.  Cook  and  Nachtsheim  (1980)  and 
Johnson  and  Nachtsheim  (1983)  have  advocated 
the  use  of  exchange  algorithms  with  embedded 
nonlinear  optimization  routines  to  determine 
the  points  to  exchange.  Cook  and  Nachtsheim 
(1980)  used  a  combined  grid-Powell  search  in 
an  attempt  to  locate  the  D-optimal  design. 
Meyer  and  Nachtsheim  (1987)  implemented  GRG2, 
a  eeneralized  reduced  gradient  method  for 
nonlinear  optimization,  within  the  k- 
exchange  algorithm. 

One  inherent  difficulty  associated  with  the 
use  of  nonlinear  optimization  routines  is  the 
convergence  at  local  optima.  As  the  dimension 
of  the  problem  and  the  number  of  terms  in  the 
model  increase,  the  number  of  local  optima  of 
the  variance  function  increases.  In  an 

attempt  to  surmount  the  obstacles  encountered 
with  current  algorithms,  we  implement  the 
simulated  annealing  algorithm  to  directly 
maximize  the  determinant  of  X'X,  and  evaluate 
its  performance  on  both  finite  and  convex 
design  spaces. 

Haines  (1987)  applied  the  simulated 
annealing  algorithm  to  construct  various 
n-point  optimal  designs  using  several  criteria 
for  polynomial  regression  of  up  to  degree  5 
and  for  the  second  order  model  with  2  factors. 
Trial  designs  were  constructed  by  successively 
perturbing  individual  points.  The  algorithm 
was  most  effective  in  constructing  G-optimal 
designs  that  minimize  the  maximum  variance 
function. 

We  modify  the  generalized  simulated 
annealing  method  described  by  Bohachevsky, 
Johnson,  and  Stein  (  1986)  to  m.nximize  jl^'X  . 
This  algorithm,  which  has  the  "ability  to 
migrate  through  a  sequence  of  local  extrema  in 
search  of  the  global  solution  and  to 
recognize  when  the  global  extremum  has  been 
located"  (Bohachevsky,  et  al.  1986,  p.  209) 
substantially  improved  the  D-optimal  11-point 
design  for  a  specific  nonlinear  problem  with 
many  constraints  given  by  Bates  (1983). 
Generalized  simulated  annealing  makes  the 
probability  of  accepting  a  detrimental  step 
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tend  to  zero  as  the  random  walk  approaches  the 
global  optimum. 

Application  of  the  Generalized  Simulated 

Annealing  Algorithm 

The  models  we  consider  are  the  first  order 
and  second  order  response  surface  models. 

The  first  order  model 

E(y)  =  PQ  •  j  =  1.2,...,q  has 

p  =  q  +  1  parameters.  The  second  order  model 
E(y)  =  6q  +EBjXj  +  j  ,k  =  l,2,...,q 

contains  p  =  (q+l){q+2)/2  parameters. 

The  construction  of  D-optimal  designs 
requires  the  maximization  of  jx'xj  ,  the  value 
of  which  is  quite  large  particularly  for 
problems  with  many  factors  or  many  design 
points.  The  values  of  the  parameters  used  in 
applying  the  generalized  simulated  annealing 
method  are  simpler  to  adjust  if  the  objective 
function  is  defined  as  the  maximization  of 
jx'xj 1/p  ,  where  p  is  the  number  of  parameters 
in  the  model.  To  ensure  the  desired  behavior 
of  the  probability  near  the  global  optimum, 
the  objective  function  is  defined  to  converge 
to  zero  at  the  global  optimum.  The 

maximization  of  |X'X|^/P  is  thus  substituted 
by  its  equivalent,  the  minimization  of  >(X)  = 
^max  '  jX'X  I  1/p  ,  where  Dmax  is  the  user's 
prior  estimate  of  the  optimum  determinant.  If 
the  value  of  the  maximum  determinant  is 

understated  and  y(X)  becomes  negative,  Dmax  is 
increased  and  the  search  Is  continued.  If  the 
estimate  is  too  large,  Dmax  is  decreased  to 
ensure  the  objective  function  converges  to 
zero . 

Finite  Design  Spaces 

The  D-optimal  design  for  first  order  models 
has  been  shown  to  consist  entirely  of  Llw 
vertices  of  the  q-dimensional  hypercube  (Box 
and  Draper,  1971  and  Mitchell,  1974).  The 
candidate  set  contains  29  points;  each 
coordinate  of  Xj^  is  -1  or  +1.  For  second 
order  models  we  rssume  each  coordinate  may  be 
-1,  0  or  +1,  defining  39  points  in  the 
candidate  set. 

The  algorithm  begins  with  an  nxq  starting 
matrix  consisting  of  the  coordinates  of  n 
points  chosen  randomly  from  the  candidate  set. 
At  each  iteration,  a  trial  design  is  defined 
by  perturbing  m  <  nq  of  the  coordinates.  If 
the  value  of  the  objective  function  is 
decreased,  the  trial  design  is  accepted  with 
probability  1.  it  the  value  is  increased,  the 
trial  design  is  accepted  with  probability 
p  =  cxp{-  z,;{X)/iXo)  ! 

where  is  a  nonnegative  control  parameter, 
;(X)  is  the  change  in  the  objective  function 
value,  and  :(X(  )  is  the  current  value  of  the 
objective  function.  The  appropriate  values 
for  m  and  depend  on  particular  problem 

characteristics  and  are  found  by 

experimentation.  As  the  algorithm  is 

executed,  the  value  of  m  is  gradually 
decreaseti  to  a  minimum  of  1  as  the  global 
minimum  is  approached,  in  which  case  a  single 
coordinate  change  is  made  to  define  a  trial 
des i gn . 


The  steps  of  the  algorithm  with  finite 
design  spaces  are  as  follows: 

1.  Generate  a  random  starting  design,  Xq. 

2.  Calculate  v(Xq).  If  I'PCXq)!!!  c  , 
go  to  7 . 

3.  Determine  a  trial  design,  X,  by  randomly 
selecting  m  coodinates  to  change. 

a.  For  first  order  models: 

If  Xj^j  =  -1,  set  Xj^j  -  +1. 

If  =  +1,  set  =  -1 

b.  For  second  order  models: 

If  xpj  =  -1,  set  xpj  =  0. 

If  Xj^j  =  +1,  set  Xij  =  0. 

If  Xj^j  =  0,  set  xpj  =  +1  with 
probability  .5. 

Set  xj^j=  -1  with  probability  .5. 

4.  Calculate  the  new  value  of  the  objective 
function  j(X)  and  let 

i;.{X)  =  ^(Xo)  -  f(X). 

If  1?(X)  lie  ,  go  to  7. 

5.  If  ?(X)  i  $(Xo),  let  Xn  =  X  and 

$(Xq)  =  <S>(X) .  Go  to  3. 

6.  If  <f(X)  ><p  (Xq). 

let  p  =  exp  {-  04$  (X)/$(Xo)}  and 
generate  a  uniform  [0,1]  random 
variable,  u. 

a.  Ifu>^p,  go  to  3. 

b.  If  u  <  p,  let  Xq  =  X  and 
$(Xo)  =  V  (X).  Go  to  3. 

7.  Stop. 

Convex  Design  Spaces 

We  consider  the  design  space  most  often 
used  in  experimentation,  the  q-dimensional 
hypercube  defined  by 

iJ'ij  ll;  1  =  1,2,. ..,r,,  j  =  1,2 . q. 

Since  many  of  the  optimal  design  points 
occur  at  the  vertices  of  the  design  space,  the 
algorithm  performed  better  if  the  constraint 
set  on  the  design  space  was  eliminated.  A 
useful  transformation  described  by  Box  (1966) 
and  used  by  Atkinson  (1969)  for  D-optimal 
design  computations  is 

xpj  =  sin  ypj,  for  i  =  1,2,. ..,n  and 
j  =  1,2,.. .,q. 

Then  for  all  values  of  y  ly  ■!£  1. 

A  trial  design  matrix  is  determined  by 
perturbing  each  transformed  coordinate  by  tue 
amount  .  where  Vij  is  a  random  direction 

in  nq-ciimensional  space  and  .'.m  is  the  step 
size.  The  trial  design  is  accepted  with 
probability  1  if  the  value  of  the  objective 
function  is  decreased,  and  accepted  with 
probability  p  =  pxp{-  (X)/;(Xo)}  if  the 

value  is  increased.  The  values  selected  for  m 
and  ./  depend  on  particular  problem 

characteristics  and  are  found  by 

experimentation.  The  value  of  ,'m  is  decreased 
gradually  during  execution  of  the  algorithm  to 
refine  the  design  as  the  global  minimum  is 
approached . 

The  algorithm  for  a  convex  design  space 
fol lows: 

1.  Generate  a  random  starting  dos'gn,  Xq . 

2.  Calculate  ,(X()).  If  M  Xq  )  , 

go  to  9 . 

K  hot  Yq  =  arcs  in  Xq . 
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A.  Determine  an  nxq  random  direction  matrix 
V  by  choosing  independent  uniform  1-1,1] 
random  variables,  bjj^j  ,  and  computing  the 
components  of  V:  Vj^j  =  bj^ j  /  ( ;;xbj^  j  )  1/2. 

5.  Let  Y  =  Yg  +  .'.mV;  let  X  =  sin  Y. 

6.  Calculate  the  new  value  of  the  objective 
function  v/X);  let  :.iW  =  i(Xo)  -  ^(X). 
If  ;  ?(X)  1  £  i_  ,  go  to  9 . 

7.  If  ,;.(X)  l;(Xg),  let  Xg  =  X 

and  ;(Xo)  =  ;(X).  Go  to  3. 

8.  If  ;.(X)  >;(Xo),  let 

p  =  exp{-  .,;(X)/;(Xg)}  and  generate  a 
uniform  [0,1]  random  variable,  u. 

a .  If  u  ^  p ,  go  to  A . 

b.  If  u  <  p,  let  Xg  =  X  and 
.(Xg)  =  ;(X).  Go  to  3. 

9.  Stop. 

Results 

The  algorithms  were  executed  on  the  Cr.iy-2 
supercomputer  at  the  University  of  Minnesota 
using  test  problems  for  first  and  second  order 
response  surface  models  on  both  finite  and 
convex  design  spaces.  A  detailed  account  of 
the  empirical  results  is  contained  in  Meyer 
and  Narhtsheim  (1988). 

Cone ius ions 

The  generalized  simulated  annealing 
algoritlim  was  used  to  construct  D-optimal 
designs  on  both  finite  and  convex  design 
spaces  in  an  attempt  to  overcome  the  problems 
of  premature  convergence  and/cr  computer 
infeasibility  with  high  dimensions  encountered 
with  current  algorittims.  For  the  finite 
de.sign  space,  the  only  algoritiun  currently 
available  for  construction  of  globally  optimal 
designs  is  Welch's  (1982)  branch-and-bound 
search,  which  is  not  recommended  if  N  >  30. 
Ovir  results  suggest  that  the  generalized 
sirauiatnd  annealing  algorithm  can  be  simply 
implemented  and  cheaply  used  to  search  for 
globally  optimal  designs  on  as  manv  as  N  = 
1000  caiuiidate  points.  We  have  demonstrated 
i  I  :<  uiilitv  tor  first  order  re.spon.se  .surface 
ncile  1  s  h.iving  up  lo  10  tailors,  and  for  second 
nier  models  with  as  many  as  3  factors.  The 
Cost,  however,  is  that  D-opt  ina  1  1 1  y  is  n<>t 

gu.'uve'.t  eod  . 

0  oiii'erse !  V,  our'  results  .ire  not  eiuouraging 
Mi  the  prec(.,n  convex  design  sp.'ues. 

'  on  c  i  lie  I  .1 1' I more  I'Xapufor  t  ime  w.i.s  reipiired 
tor  I  he  '  'nsfruction  ol  U-i'ptimal  designs  on 
.vv.'ev  spares  t  lian  .'ii  finite-  spaies.  Hie 
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A  SIMULATED  ANNEALING 
APPROACH  TO  MAPPING  DNA 

I.ARRY  GOLDSTEIN  MICHAEL  S.  WATERMAN 
UNIVERSITY  OF  SOUTHERN  CALIFORNIA 


Summary 

'I'lie  double  digest  mapjiing  problem  that  arises 
in  molecular  biology  is  an  \I’  complete  problem 
that  siiarcs  similarity  with  both  tlic  travelling 
salesman  problem  and  the  partition  problem.  Se- 
finenres  of  1).\A  arc  cut  at  short  specific  paUerns 
liy  one  of  (wo  restriction  enzyme.s  singly  and  then 
by  both  in  combination.  From  the  set  of  result¬ 
ing  lengths,  one  is  rctpiired  to  constr\ict  a  tnap 
showing  the  location  of  cleavage  .sites.  In  order 
to  implement  the  simulated  annealing  algorithm, 
one  must  define  appro,  late  neighborhoods  on  the 
configtiriit ion  spat  e,  in  this  rase  a  fiair  of  [termu- 
I  at  ions,  and  an  energy  function  to  minimize  that 
attains  its  global  tninimum  value  at  the  trut?  so¬ 
lution.  Ue  study  the  'rfortnanre  of  the  simu¬ 
lated  arinealitig  ttlgoril  n  for  the  dfiuble  digest 
[troblern  witlt  a  particuhir  energy  function  and  a 
neighborhood  structure  based  on  a  deterministic 
procedure  for  the  travellitig  .salesimin  problem. 


1  Introduction 

The  .simulateii  annealing  algorithm  has  shown 
[.ronii -e  on  a  t  at  it  t of  <  oiiibinat orially  hard  prob- 
h-ftis.  sntdi  as  the  .\j’  coj/jjdeie  trti^'elling  sah-s- 
man  (.rot.li  :ri  1  .  lielow.  '\e  study  the  pi-rfor- 
turinee  of  an  iinjdernenl  at  ion  of  the  simuhi'tsi 
■mnealing  algoiil'nmn  on  the  dould'-  digest  rTia|)- 
|iu:g  problem,  an  .Nl’  lonijiliUe  |iroblem  arising 
in  molia  iilar  biology.  The  double  digest  mapping 
ptobleni  ean  he  ronghlv  as  follons.  A  ri’- 

- 1  - t  ion  en/  \  me  .  ni  -  a  strand  ol  i )  N  .\  .  regardeil 
a-  ..  liiMO'  -ei|u.'ii(e  over  the  lour  letti'r  alpfiabet 
f  A  .t .r  } .  at  a;!  of.nr.'enfe.  iii  a  pattern  sj.c- 
<  ila  to  !  :,a'  e;i/',  t||,  ;  ;  la.  ;iat rn  -  are  -fi.ii!  .  t>  p- 
'  )  ■  o  t.  Ie' ' ,  r  in  a  eg'  ii  ( (n  t  la-  le  -nil  Ing 

•|  ..ge  ,  :  •  a  ng'  i,  .,o  o  .  ooi,  W  )a  n  'f.-.  -  m 

,:'e,n  ,  a., O'.  o  1  ...  .  o.;  v..,'e 


this  area  see  |13|,  [I2’,|3l,  [1  ]  j,  [2,, and  [14). 

It  is  perhaps  not  surprising  that  the  double 
digest  problem  in  a  member  of  the  class  of  NP 
complete  problems,  a  cla.ss  of  problems  for  which 
no  ])olynomial  time  algorithms  are  known.  This 
may  be  demonstrated  by  showing  tiiat  a  special 
case  of  the  double  digest  problem  is  an  NP  com¬ 
plete  problem  known  as  the  partition  problem. 
Heine,  the  double  digest  problem  is  at  least  as 
hard  as  the  partition  problem,  and  itself  belongs 
to  the  class  of  NP  hard  problems. 

Given  tliat  therefore  it  is  tinlikely  one  will 
find  a  fast,  polynomial  time  algorithm  to  solve 
the  double  digest  jiroblem,  one  may  turn  to  the 
simulated  annealing  algorithm,  a  recent  proba¬ 
bilistic  procedure  that  has  enjoyed  some  success 
on  combiiuilorially  hard  prolilems  of  this  nature. 

'I'his  (laper  is  a  report  on  the  apjilication  of 
llie  simulated  annealing  algorithm  to  the  dou- 
l)le  digest  problem.  W’e  first  give  a  mathematical 
description  of  llie  double  digest  iiroblem.  Next, 
we  show  that  the  dotible  digest  problem  is  an 
.Nl’  complete  problem.  In  the  section  tliat  fol¬ 
lows,  we  give  a  description  of  the  simulated  an¬ 
nealing  algorithm  i-  general,  and  state  how  it 
may  be  applied  to  tiie  [iroblem  a!  hand.  Lastly, 
we  conclude  with  some  remarks  on  the  effective¬ 
ness  of  the  prociidere  in  this  instance  and  on  the 
liori’inicpienc.ss  of  ..-olut  ion.s  to  the  doiibli'  digest 
priilile.ii  in  giuieral. 

2  Description  of  the  Double 
Digest  problem 

Tlu'  ( dn  Lu  stiiTi'il  fol- 

luv. irvtri<  tinn  «-n/\niu  (  u’v  |,ic(  r  of  1)N.\ 
n!  }.  a!  all  of  a  ''hori  >|)unlir 

paMmi  ainl  tin'  ol  il;r  Ira^- 

!ri<  c!  -i  .1  :»•  r« «  or  •!«  .  I h  Un  -  'loiihS  ti proMi'm 

w  «■  li.r. «-  .1  •  '!.o  .1  *  1;*  !:-?  '  { r  an  rii«  n?  !<  n » ),  s.  hru 
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Z?  -  {6,  :  1  <  i  <  m} 

frotn  the  second  digest,  as  well  as  a  list  of  dou¬ 
ble  digest  fragment  lengths  when  the  restriction 
enzymes  are  used  in  combination  and  the  DNA 
cut  at  all  occurrences  specific  to  both  patterns, 
say 

{c,  :  1  <  i  <  ni  ::}; 

only  length  information  is  retained.  In  general 
A. /I  and  C  will  be  multisets;  that  is,  there  may 
be  values  of  fragment  lengths  that  occur  more 
than  once,  W’e  adopt  the  convention  that  the 
sets  ,l,/i,and  C  are  ordered,  that  is,  a,  <  for 
i  <  j.  and  likewise  for  the  sets  B  and  C.  Of 
course 

E  a-  -  E 

1  I  ^  n  1  1  *-  ?»t  1 1;  *  ^  J  2 

since  we  are  a,ssuming  that  fragment  lengths  are 
measured  in  number  of  letters  v.iih  i.o  c.'rors. 
Oiven  the  above  data  the  problem  i.s  to  find  or¬ 
derings  for  the  sets  .4  and  B  such  that  the  doubl(> 
digest  implied  by  these  orderings  is,  in  a  sense 
made  precise  below,  C. 

W'e  may  expre,ss  the  double  i)ig<'si  problem 
more  precisely  as  follows.  Let  denote  tlu-  set 
of  all  permutations  on  /,•  objects,  Fo,  ■  ,s',  .  p 

,S'„,  call  (rj.fi)  a  ( cinfiguriiiiou,  liy  older. ng  ,\  and 
B  according  to  o  atid  p  respt-c' ively,  we  oht.iiti 
the  set  of  locations  of  <  lit  sites 

,s  { ,s  ;  ,s  S  It ,  -  or 

1  ,,  . 

,s  ^  ;  tt  ■  r  ■  to  It  t  ■  rtt  } , 

I  ■  ,  ■  I 

sitift'  Wo  w.itit  to  record  tuilv  llu-  locati.iu  ot 
I  Ut  -ite-,  tie  -It  S  is  not  allowed  rejX'I  it  io,,-..  tli.il 
i.e  j-  not  a  in'jlti-el,  ,Now  label  t  he  elen.etit  s  ol 
S  end;  I  i;at 

,s  {>;  ■  It  ■  /  ^  "I  ;} 


The  double  digest  implied  by  the  configura¬ 
tion  (o,p)  can  now  be  defined  as  the  lengths  that 
result  when  the  fragment  is  cut  at  the  locations 
indicated  by  S,  that  is,  by 

C'(ct,p)  -  {c,(ff,p)  :  c,((j,p)  s,  -  ,Sj  1 
for  some  1  <  i  <  ni  2} 

where  we  assume  as  usual  that  the  set  is  ordered 
ill  the  index  i.  The  problem  then  is  to  find  a 
configuration  (<r,p)  such  that  C  =  C((T,/i), 

We  note  for  future  reference  that  the  function 
/  on  the  configuration  ,spacc  given  by 

attains  its  global  minimum  value  of  zero  at  the 
configuration  (o, //,)  if  and  only  if  this  configura¬ 
tion  is  a  solution  to  the  double  digest  problem. 
Hence,  we  may  consider  an  equivalent  formula¬ 
tion  of  the  double  digest  problem:  find  where  / 
attains  its  global  minimum  value  of  zero. 


3  Computational  Complex¬ 
ity  of  the  Double  Digest 
Problem 

We  demoustrate  below  that  the  double  digc.sl 
problem  is  .VP  eomplete.  It  is  clear  that  the  doii- 
hle  dige.si  problem  J)I)P  as  described  above  i.s 
in  the  (  lii.s.s  .VP,  as  a  nondetorminist ic  algoritlim 
need  only  guess  a  configuration  (a,p)  and  check 
in  polynomial  time  if  The  number 

of  steps  to  cherk  this  is  in  fart  litiear.  To  show 
tiiat  l)l)P  is  NP  complete  we  transform  the  |iar- 
tilion  problem  to  I)I)P.  In  the  partition  problem, 
know  n  to  b('  NP  comjilete  1  .  we  are  given  a  finite 
-et  .-I.  sav  .1  n.  and  a  positive  integer  ,s(n)  for 
•  ■a<  h  <1  ■  A  and  wish  to  determine  whether  then- 
ixi'ts  a  suiiset  .4'  .4  such  that 


.  s{,i)  ^  ■‘-'(n). 
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If  So6/t  ■*(“)  =  -^  is  not  divisible  by  two,  there 
can  be  no  such  subset  A’-,  else,  consider  as  input 
to  problem  DDP  the  data 

A  =  {^(ait)  :  1  <  fc  <  n} 

B  =  {J/2,J/2}  and  set  C  =  yl. 

It  is  clear  that  any  solution  to  problem  DDP 
with  this  data  yields  a  solution  to  the  partition 
problem  through  the  order  of  the  implied  digest 
C. 

4  The  Simulated  Annealing 
Algorithm 

We  now  give  a  description  of  the  simulated  an¬ 
nealing  algorithm.  The  algorithm  is  based  on 
the  following  analogy  with  statistical  mechanics. 
To  a  given  physical  system,  there  corresponds  a 
function  /  that  assigns  to  the  state  of  that  sys¬ 
tem  its  energy.  The  algorithm  mimics  the  behav¬ 
ior  of  such  a  physical  system  moving  from  state 
to  state  in  order  to  minimize  energy. 

Specificaly,  let  V  be  a  finite  set  of  elements, 
and  /  a  function  that  assigns  a  real  number  to 
each  element  of  V.  The  elements  of  V  represent 
the  state  of  the  system,  and  we  think  of  f{v]  as 
the  energy  of  the  system  when  in  state  v. 

In  statistical  mechanics,  the  Gibbs  distribu¬ 
tion  gives  the  probability  of  Finding  the  system  in 
a  particular  state.  Introducing  the  temperature 
parameter  T,  we  write  the  Gibbs  distribution  as 

-t{v)  ■-  exp{  f(i')/T}/ZT, 

v'  here  Zj,  the  partition  function,  is  chosen  such 
th  at 

^  7rr(i;)  1. 

bCV' 

For  large  values  of  T  the  di.stribiition  lends  to  be 
uniform  over  V',  while  for  small  values  of  T  the 
favorable  element.s  of  V ,  that  is,  those  elements  of 
r  for  which  f{v)  is  small,  are  weighted  with  large 


probability.  Therefore,  a  probabilistic,  solution 
to  the  problem  of  locating  an  clement  v  C  V 
for  which  f(v)  is  minimized  is  given  by  sampling 
from  the  distribution  ttt-  for  small  T  >  0. 

One  way  this  may  be  achieved  is  to  simulate 
a  Markov  chain  {X„}„>o  with  state  space  V  that 
has  TTj  as  its  stationary  distribution  and  let  it  ap¬ 
proach  equilibrium.  It  is  possible  to  write  down 
an  explicit  formula  for  the  transition  law  of  such 
a  Markov  chain. 

Simulating  a  Markov  chain  of  this  type  with 
the  parameter  T  fixed  was  proposed  by  Metropo¬ 
lis  et  al.  [lOj.  One  may  observe  that  in  the  con¬ 
text  of  minimization,  the  smaller  the  value  of  T 
the  higher  the  probability  of  finding  the  state  of 
global  minimum  energy.  Kirkpatrick  et  al.  ;8i  in¬ 
troduced  the  idea  of  cooling  the  system  in  the 
hope  that  in  the  limit  one  would  obtain  the  dis¬ 
tribution  TTo  that  puts  mass  one  uniformly  over 
the  states  of  minimum  energy.  In  this  w'ay  the 
algorithm  resembles  the  physical  process  of  an¬ 
nealing,  or  cooling,  a  physical  system,  .^s  in 
the  physical  analog,  the  system  may  be  cooled 
loo  rapidly  and  become  trapped  in  a  state  cor¬ 
responding  to  a  local  energy  minimum;  Geman 
and  Geman  j6j  (see  also  Hajek  |7  )  showed  that  if 
at  stage  n  in  the  algorithm  one  cools  the  system 
with  a  sequence  of  temperatures  Tn, where  Tn  i  0 
and  Tn  >  c/log(n)  with  c  a  constant  that  de¬ 
pends  on  /,  then  the  state  of  the  Markov  chain 
converges  in  distribution  to  ttj. 

In  order  to  simulate  the  Markov  chain  it  is 
required  to  specify,  for  each  possible  state  u,  the 
collection  of  states  where  transitions  are  to 
be  allowed.  We  call  such  a  collection  of  states 
neighbors  of  a  given  slate.  Of  course,  we  must 
re()uire  each  state  to  be  reachable  from  any  other 
state  through  a  sequence  of  neighbors. 

Our  neighborhood  structure  was  motivated 
by  a  neighborhood  structure  used  in  a  simulated 
annealing  algorighm  for  the  travelling  salesman 
problem  ll],  which  in  turn  was  based  on  a  de¬ 
terministic  procedure  for  that  particular  prob¬ 
lem  ;9  .  In  the  travelling  salesman  problem  one 
is  required  to  find  the  lour  of  shortest  length  that 


visits  n  given  cities  in  the  plane.  Hence,  the  con¬ 
figuration  space  for  the  travelling  salesman  prob¬ 
lem  is  the  set  of  permutations,  where  a  particular 
permutation  gives  the  order  in  which  cities  are  to 
be  visited.  For  the  double  digest  problem,  as  de¬ 
scribed  in  section  2,  a  configuration  is  a  pair  of 
permutations. 

We  now  describe  a  neighborhood  structure 
for  the  travelling  salesman  problem(  [1],  |9]).  If, 
for  a  given  permutation,  or  tour,  a  we  imagine 
links  connecting  cities  in  the  tour,  we  say  that  the 
tour  a  is  A:-optimai,  or  fc-opt  for  1  <  /c  <  n,  if  for 
all  tours  that  can  be  obtained  f.-om  cr  by  break¬ 
ing  at  most  k  links,  the  tour  given  by  er  is  the 
shortest.  Thus,  every  tour  is  1-opt  and  only  the 
true  best  tours  are  n-opt.  We  define  a  neighbor¬ 
hood  system  using  the  concept  of  2-optimality. 
For  a  given  tour  a  =--  ft'i,  to, ..,  i„)  visiting  city  i, 
then  tj  and  so  on,  let  the  neighborhood  of  a  be 
defined  by 

,V(o)  --  {r  G  5„  :  r  ^  (!i,t2,.., 

-  1 1  t  fc,  t *:  I )  .'I  tj  4  I ,  t^,  t Jf.,  1 ,  ...,  t„) 

for  some  I  <  ]  <  k  <  n}. 

It  is  not  diflicull  to  see  that  this  notion  of  neigh¬ 
borhood  allows  one  to  transition  from  any  state 
to  any  other  .state  through  a  secjuence  of  ncigh- 
borhs. 

for  the  double  digest  proltlorn  our  configura¬ 
tion  space  is  a  pair  of  [termutations.  Accordingly, 
for  this  problem  we  maj-  define  a  neighborhood 
()!  a  configuration  (ct,^i)  by 

X((7.^L] 

{(r.//):r<  .V(a)}  .{(o,i/):t/c  A'(p)} 

where  .V(/))  are  the  neighborhoods  used  in  the 
discussion  of  the  travelling  salesman  problem 
above. 

We  conclude  this  .sectioi]  with  an  ex|)licit  de¬ 
scription  of  the  sitiiulatttd  annealing  algorithm. 

Let  the  initial  state  t.'i  be  atl  arbitrary  ele¬ 
ment  of  tiie  configiiation  space  .S'.  .■\t  stage  ti .  let 
us  say  till'  state  of  the  system  is  ii  {a./t).  .Set 


T„  ~  Select  a  neighbor  v  6  uniformly 

from  A'u-  For  the  case  at  hand,  this  selection 
may  be  done  in  the  following  manner.  Choose 
to  invert  cither  a  or  r,  each  with  equal  probabil¬ 
ity.  Say  a  is  chosen.  We  now  randomly  invert  a 
portion  of  the  “lour”  given  by  a  in  such  a  way 
that  all  inversions  are  equally  likely,  yielding  a 
new  “tour”,  say  r.  Let  v  —  (T,/i).  Compute 
A  --  fiv)  -  /(u).  If  A  0  then  accept  v  as 
the  new  state  of  the  chain  for  iteration  n  4  1.  If 
A  >  0,  accept  v  as  the  new  state  of  the  chain 
with  probability  p  -  exp{  A/Tk}  and  keep  u  as 
the  new  state  for  iteration  n  4  1  with  probability 

1  -  p. 

5  Performance  of  the  Algo¬ 
rithm 

U'ith  the  above  framework  in  place,  the  sirmi- 
laled  annealing  algorithm  was  run  on  both  sim¬ 
ulated  and  actual  mapping  problems. 

The  performance  of  the  algorithm  on  large 
simulated  problems  led  us  to  suspect  that  in  gen¬ 
eral  solutions  to  the  double  digest  mapping  prob¬ 
lem  are  not  unique.  In  fact,  under  a  certain 
probability  model,  the  number  of  solutions  to  the 
double  digest  mapping  problem  increases  expo¬ 
nentially  in  the  length  of  the  segment  [o].  The 
performance  of  the  algorithm  for  these  problems 
is  liieiefore  cuiiroundcu  by  liie  large  number  of 
exact  solutions. 

For  mapping  the  bacteriophage  lambda, 
'l.^„3G0  base  pairs  in  Icngtli,  with  the  restriction 
enzymc.s  Bamlll  and  KcoRl  each  which  cut 
lambda  into  6  pieces  of  distinct  lengths  for  a 
Itroltlem  of  size  CIO!  518,000,  the  algoritlim 
wa.s  able  to  find  the  correct  solution  in  2't,702. 
f)8q.").  and  .'i()70  iterations  in  runs  from  three  dif¬ 
ferent  initial  coiidil  ions.  It  is  interesting  to  note 
that  the  solution  to  this  actual  problem  was  in 
fact  unique.  Further  details  may  be  fotind  in  .5  . 
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Abstract 

One  can  easily  conjecture  that  we  humans  have  imposed 
sequential  solutions  onto  most  problems,  such  are  a  better 
match  to  our  physical  architecture,  but  we  propose  that 
there  are  parallel  solutions  to  many  problems  and  these 
are  a  better  if  they  can  be  matched  to  our  computer  archi¬ 
tectures.  The  discovery  of  problems  involving  parallelism 
in  many  and  diverse  disciplines  which  are  the  subject  of 
current  research  efforts  has  been  a  simple  matter,  however 
the  development  of  methods  which  discover  the  parallel¬ 
ism  possible  in  solutions  to  a  problem  is  not  a  simple 
matter  and  is  the  focus  of  this  research.  This  paper  will 
describe  the  model  and  discuss  the  current  research  efforts 
in  terms  of  academic  contributions  and  the  strengths 
gained  through  the  interdisciplinary  group  approach  to 
problem  solving. 

At  Kansas  State  University  a  group  of  people  from  three 
disciplines  in  two  colleges  has  been  formed  to  provide  a 
critical  mass  of  researchers  and  to  create  broader  base  of 
knowledge  from  which  to  draw  to  find  an  architecture-free 
model  which  can  be  used  to  express,  in  a  natural  way.  the 
potential  concurrency  in  problem  solutions.  A  partially 
defined  model  based  upon  a  conditioned  dataflow  which 
incorporates  the  concepts  of  control  flow  based  on 
dataflow,  of  the  description  of  an  action  at  any  level  of 
detail  with  subsequent  further  refinement  if  desired,  of 
repetition  based  upon  partitions  of  data  aggregates,  of  sin¬ 
gle  assignment  of  values  to  uniquely  identify  each  incarna¬ 
tion  of  data  objects,  and  of  partial  computation,  i.e..  com¬ 
putation  which  can  proceed  until  a  needed  unavailable 
datum  is  encounter  has  been  developed.  The  group  has 
four  major  foci  to  their  work.  1)  continuing  development 
of  the  theoretical  fou.'.dation  of  the  model,  led  by  the  com¬ 
puter  scientists.  2)  use  of  the  model  to  discover  paradigm 
parallelism  models  for  particular  problems  at  the  small 
and  the  large  granularity  levels  of  detail,  led  by  the  statis¬ 
tician  and  engineers  3)  the  development  of  methods  of 
determining  the  best  fit  of  the  discovered  parallelism  to 
existing  architectures.  led  by  the  statistician  and  engineers. 
4)  the  continued  implementation  of  a  prototype  on  a  dis¬ 
tributed  network  of  proce.ssors.  led  by  the  computer  scien¬ 
tists.  All  members  have  contributed  to  all  phases. 

The  current  status  of  our  work  included  a  model  which 
has  been  shown  to  contain  a  core  of  statements  which 
always  describe  determinate  problem  solutions  for  atomic 
data  types.  A  prototype  is  being  used  to  study  problem 
solutions  where  the  granularity  of  the  parallelism  is  small. 
On  going  research  work  involves  providing  the  theoretical 
basis  for  temporally  partitioned  data  aggregates,  the  inclu¬ 
sion  in  the  prototype  of  partial  computation,  and  limited 
data  structures  and  the  development  of  models  of  existing 
architectures  using  the  model  for  the  current  multiproces¬ 
sor  architectures. 

1.  Introduction 

Traditionally,  computing  machine  design  and  the  choice  of 
problem  solution  has  been  predicated  upon  the  sequential 
expression  of  computation.  The  advent  of  multiple  proces¬ 


sor  architectures  and  computer  networks  requires  a 
different  approach  to  problem  expression  to  fully  utilize 
the  available  computational  power.  The  primary  com¬ 
ponent  of  this  approach  is  the  division  of  a  problem  solu¬ 
tion  into  computational  units  and  the  ordering  of  the  exe¬ 
cution  of  these  divisions.  In  thio  paper,  a  model/method 
will  be  developed  which  is  based  on  the  examination  of  the 
flow  of  data  and  on  aggregation  data  to  discover  parallel¬ 
ism  in  numerical  algorithms.  This  method  of  parallel 
computation  seems  to  hold  the  great  promise  for  statistical 
application  (Lafaye  de  Micheaux.  1984). 

Unlike  much  of  the  current  research  in  parallel  algorithms 
for  statistical  and  numerical  linear  algebra  problems,  the 
model/method  developed  here  is  architecture  independent 
(Heller  1978.  O'Leary  1985.1980.  Gokhale  1987].  Through 
the  fundamental  ideas  of  dataflow  computation  [Dennis 
1972]  and  spatial  and  temporal  partitioning  of  data  struc¬ 
tures  into  computationly  independent  units  [Unger  1978]  . 
the  inherent  parallelism  in  a  problem  solutions  can  be 
specified  without  the  traditional  concerns  of  communica¬ 
tion.  synchronization,  data  sharing  and  physical  architec¬ 
ture  [McBride  1983].  The  architecture  free  approach  to 
parallel  computing  is  prompted  by  the  idea  that  only  after 
the  inherent  parallelism  of  a  numerical  method  is 
expressed  as  grouping  of  the  data  (not  necessarily  limited 
to  the  data  aggregates)  does  the  architecture  become  a  con¬ 
sideration.  Jamieson  (1987)  calls  this  the  virtual  algo¬ 
rithm  for  a  problem  solution  approach.  Different  architec¬ 
tures  will  give  rise  to  different  sequencings  of  the  indepen¬ 
dent  computational  units  from  this  virtual  algorithm. 

This  paper  is  divided  into  three  parts.  Section  2  gives  a 
description  of  the  basic  model/method  we  have  developed. 
This  section  is  followed  by  a  discussion  of  the  concepts  of 
data  and  procedural  abstraction.  Section  4  deals  with  the 
notion  of  the  existence  of  a  virtual  algorithm  with  a 
motivating  example. 

2.  Basic  Concurrency  Model/Method 

In  this  section  a  mathematically  based  model  which  can 
be  used  for  concurrent  computation  is  presented.  This 
concurrency  model/method  is  built  upon  the  concept  of 
representing  both  data  and  action  as  objects  [Unger  1988b]. 
The  fundamental  principles  of  sequencing  these  objects 
will  be  illustrated  as  well  as  how  this  sequencing  can  be 
altered  through  the  use  of  predicates. 

Objects  representing  action  can  be  aggregated  into  collec¬ 
tions  of  objects  resulting  in  more  abstract  action  object  or 
disaggregated  into  several  less  abstract  action  objects.  Each 
action  object  can  be  represented  as  a  5-tuple  (s.m.a.r.t) 
where  s  is  a  boolean  predicate  whose  truth  value  deter¬ 
mines  when  the  action  will  be  eligible  for  execution,  m  is  a 
list  of  materials  (input  objects),  a  is  the  name  or  designa¬ 
tor  of  the  action,  r  is  a  list  of  results  (output  objects),  and 
t  is  a  boolean  predicate  whose  truth  value  determines  if 
and  when  the  action  should  terminate  prior  to  completion. 


For  illustrative  purposes,  the  following  syntactic  form  for 
an  action  object  will  be  used. 

[s]  a(m;r)  [t]  . 


where  the  elements  of  m  and  r  are  separated  by  commas. 
For  example,  computing  the  length  of  the  hypothenuse  of  a 
right  triangle  using  the  Phythagorean  Theorem  could  be 
expressed  as  shown  in  Figure  1. 


Model  Objects 
Sqrt(temp3;  c) 

Add( tempi.  temp2;  temp3) 
Sqr(a:  tempi) 

Sqr(b;  temp2) 


Interpretation 
\/a^  +  b^  -•  c 
a^  +  b^ 
a^ 
b^* 


Figure  1:  Hypothenuse  Computation 


The  model  is  data  driven.  This  means  that  the  time  at 
which  an  action  is  first  eligible  for  execution  is  when  all  of 
the  elements  of  m.  the  materials  list,  are  available.  Figure 
2  gives  the  dataflow  diagram  [Petersen  1977.  Karp  1966. 
McBride  1987.  Noe  1973].  for  the  hypothenuse  computa¬ 
tion  of  Figure  1.  It  is  drawn  such  that  an  action  appears  at 
the  first  level,  horizontally  at  which  it  is  eligible  for  com¬ 
putation.  At  level  1  of  Figure  2.  there  are  two  actions 
which  can  be  computed  concurrently,  this  represents  the 
only  inherent  parallel  computation  in  the  example.  Note 
that  since  the  sequencing  of  computation  is  driven  by  the 
availability  of  the  materials,  or  inputs,  the  order  in  which 
the  syntactical  statements  are  listed  is  immaterial.  Thus 
allowing  each  problem  solver  freedom  to  conceive  the 
problem  solution  in  the  most  natural  way  for  them. 


Figure  2:  Data  Flow  for  Hypothenuse  Computation 

Data  objects  in  the  model  have  two  important  components, 
the  designator  and  the  corporality  type.  The  designator 
contains  an  arbitrary  name  assigned  by  the  problem  solver. 
Corporality  or  the  length  of  existence  of  an  object  provides 
the  capabilities  to  assure  the  delerminacy  of  the  problem 
solution  results.  Two  corporality  types  of  the  model 


provide  the  concept  of  the  single  assignment  of  a  value  to 
an  object  [Comte  1976].  If  the  corporality  type  of  a  data 
object  is  static,  only  one  value  may  ever  be  assigned  to 
that  data  object.  The  default  corporality  of  a  data  object 
is  dynamic.  When  the  corporality  type  of  a  data  object  is 
dynamic,  the  model  adds  a  sequence  indicator  to  the  desig¬ 
nator.  This  can  be  envisioned  as  the  data  object  having  a 
series  of  incarnations,  each  distinguished  by  the  sequence 

indicator  (e.g..  Xi.Xi+i.Xj+2 . )  Objects  with  corporality 

type  of  dynamic  can  be  referenced  by  their  designator  and 
sequence  indicator  or  by  their  designator  alone.  If  a 
dynamic  object  is  referenced  without  a  sequence  indicator, 
the  latest  available  incarnation  of  the  object  is  retrieved. 
An  additional  corporality  type  of  fluid  is  also  defined  by 
the  model.  A  data  object  with  the  corporality  type  of  fluid 
can  change  value  with  no  incarnation  indicator  (like  the 
common  implementation  of  variable  in  current  program¬ 
ming  languages).  Such  objects  are  currently  not  allowed  in 
the  determinant  subset  of  the  model  and  will  not  be  dis¬ 
cussed  further  in  this  paper. 

Figure  3  gives  the  calculation  of  a  Fibannoci  sequence  of 
numbers  as  it  would  be  described  within  the  model.  In 
this  example  the  data  object  designated  x  has  a  dynamic 
type  of  corporality.  The  statements  1  and  2  indicate  abso¬ 
lute  references  to  the  incarnations  x^  and  Xj.  The  state¬ 
ment  3  references  the  data  object  x  in  a  relative  fashion 
directing  that  the  previous  incarnation  is  to  be  added  to 
the  current  incarnation  to  form  the  next  incarnation. 


Model  Objects 


Interpretation 


declare  x  dynamic; 
assign  (1;  x.O) 
assign  ( 1 ;  x.  1 ) 
add(x.-l.  x;  x.+  l) 


x„  -  1 

Xi  -  1 

*i+l  “  +  Xi_i 


[1] 

[2] 

[3] 


Figure  3:  Use  of  Data  Objects  with  Dynamic  Corporality 

The  granularity  of  the  action  and  data  objects  can  vary. 
The  smallest  granularity  action  object  are  those  that 
specify  primitive  actions,  (e  g.  +.  -.  *.  /).  Large  granularity 
action  objects  are  ones  in  which  considerable  detail  must 
be  provided  in  terms  of  the  composing  actions  before  prim¬ 
itive  actions  are  specified.  The  action  objects  specified  in 
Figure  1  are  atomic  actions,  hence  they  have  small  granu¬ 
larity.  If  the  action  objects  in  Figure  1  were  aggregated 
together  into  the  action  object  Phythagorean(a.  b;  c).  say. 
then  this  would  be  an  example  of  an  object  with  larger 
granularity.  Syntactically  this  will  be  denoted  as  shown 
in  Figure  4. 

Phythagorean(a.  b:  c) 

I  Sqrt(temp3;  c) 

Add( tempi.  temp2:  temp3) 

Sqr(a;  tempi) 

Sqr(b;  temp2)  )  . 


Figure  4:  Phythagorean  Action  Object 

This  aggregate  object  has  the  designator.  Phythogorean  and 
two  input  (material)  objects,  a  and  b.  The  use  of  this 
aggregate  object  and  the  accompanying  dataflow  is  shown 
Is  Figure  5. 
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Phy thagorean  (  sidej ,  sidej  :  hypothenuse) 


Figure  5:  Aggregate  action  object  Phythagorean 

Values  for  each  of  the  data  objects,  sidel,  and  side2  are 
required  for  the  action  object  Phythagorean  to  execute. 
We  can  use  an  aggregate  data  object  say.  A.  composed  of 
sidel,  and  side2  and  then  the  use  of  the  action. 
Phythagorean  could  be  expressed  as  shown  below. 

partition  A:  sidej.  sidej 
Phythagorean  (A;  hypothenuse) 

The  use  of  aggregate  data  objects  allows  us  to  form  a  par¬ 
tition  of  a  data  structure.  For  example,  consider  the  pay 
check  computation  (CHECK)  for  a  company  with  100 
employees  and  two  computers.  If  the  input  to  CHECK  is 
pay_fiie  which  consists  of  1000  records,  we  can  aggregate 
the  data  into  2  aggregate  data  objects  payi  and  payj  as 
shown  in  Figure  6. 

partition  payi  :  pay_file.  records  1-500. 

paya  ;  pay_file.  records  501-1000, 
the  call  for  action  would  be  ; 

CHEC^  (  payi  ) 

CHECK  (  payj  ) 

Figure  6;  Aggregate  data  object  based  on  partitions 

Predicates  are  used  to  govern  when  and  if  an  action  object 
is  started  or  when  an  action  object  is  terminated  prior  to 
completion  of  its  specified  task.  Predicates  which  appear 
on  a  defined  action  object  are  termed  internal  conditions. 
Predicates  used  when  an  action  object  is  requested  (called 
into  use)  are  termed  external  conditions.  For  example,  an 
internal  stimulation  condition  of  [a-b]  placed  on  the 
Phythegorean  action  object  defined  above  limits  the 
hypothenuse  computation  to  isosceles  triangles.  Any 
attempted  use  of  the  Phythegorean  object  on  non-isosceles 
triangles  would  result  in  no  action.  The  action  object 
Averages  given  below  illustrates  the  use  of  an  external 
simulation  condition. 

[count  >0]  Averages(count.  occurrences;  8182,33  ) 
where  the  detail  of  Averages  is:  Averages  (count, 
occurrences:  ai  a2  33) 

I  Mean(count.  occurrences;  ai) 

Median(count.  occurrences;  82) 

Mode( count,  occurrences:  83)  I 

The  action  object  Averages  will  be  execut'd  only  if  count 
(number  of  occurrences  or  data  pointsl  is  greater  than 
zero. 

When  executed,  this  action  object  returns  three  mea.sures 
of  central  tendency,  the  mean,  the  median,  and  the  mode. 


If  one  wished  to  merely  have  a  measure  of  central  ten¬ 
dency  but  did  not  care  which  one  or  did  not  want  all 
three,  internal  termination  conditions  could  be  used  on  the 
object  Averages  as  shown  below. 

[count  >  0]  Averages(count.  occurrences;  b) 

{  Mehn(count,  occurrences;  b)  [b*] 
Median(count.  occurrences;  b)  [b*] 

Mode(count,  occurrences;  b)  [b*]  } 

The  termination  condition  [b*]  denotes  the  existence  of  a 
value  for  b.  In  this  case  the  first  action  of  mean,  median  , 
or  mode  to  return  a  value  for  b  will  cause  the  other  two 
actions  to  terminate  immediately. 

The  model/method  discussed  in  this  section  has  a  subset 
which  will  guarantee  determinant  behavior.  Determinant 
behavior  means  that  given  same  values  for  the  input 
objects,  the  same  values  for  the  output  objects  will  result. 
It  should  be  noted  that  there  are  many  situations,  e.g.,  the 
above  action  object  Averages,  in  which  indeterminism  is 
useful.  A  general  insight  into  the  determinant  core  is  pro¬ 
vided  in  Figure  7.  If  the  model  developed  here  is  used  on  a 
computing  system,  deadlock  potential  exists.  Also,  the 
model  requires  there  be  exactly  one  viable  source  for  each 
output  this  requirement  may  be  difficult  to  ascertain. 

No  objects  with  longevity  type  fluid  must  exist. 
No  internal  stimulation  or  external  termination 
conditions  may  be  used. 

Number  of  requests  in  an  action  object  must  be 
finite. 

At  any  level  of  abstraction,  there  can  be  only 
one  viable  source  for  each  output  object  result¬ 
ing  from  a  request  with  an  external  stimulation 
condition. 

Figure  7:  General  Conditions  of  the  Determinant  Core 
3.  Abstraction 

A  fundamental  concept  in  this  research  is  that  inherent 
parallelism  in  a  problem  solution  can  be  located  by  exa¬ 
mining  the  problem  solution  at  various  levels  of  abstrac¬ 
tion.  This  section  explores  the  concepts  of  both  action 
abstraction  and  data  abstraction. 

Benjamin  Wborf  (19  )  has  said  "Language  shapes  the 
thought  and  culture  of  those  who  use  it."  The 
model/method  described  in  Section  2  provides  an  environ¬ 
ment  or  language  that  encourages  abstraction  by  its  syn¬ 
tactic  constructs  and  structure.  Top-down  statement  of 
solutions  to  problems  is  encouraged  through  the  concept  of 
detailing  or  disaggregation  of  objects.  Bottom-up  state¬ 
ment  of  solutions  to  problems  is  encouraged  through  the 
concept  of  aggregation  or  construction  of  objects. 

Detailing  of  an  object  involves  the  replacement  of  the 
object  with  a  set  of  smaller  granularity  objects  expressing 
the  same  action  or  data,  only  in  more  detail.  Detailing  can 
continue  in  a  problem  solution  until  either  an  interface 
with  an  existing  object  occurs  or  an  interface  with  a  com¬ 
putational  device  occurs.  Aggregation  or  construction  is 
the  reverse  of  detailing.  Aggregation  is  the  process  of 
defining  a  structure  or  collection  of  one  or  more  objects. 
The  basic  operations  on  the  collection  are  defined  within 
the  aggregate  and  are  the  operations  used  when 
imstanliations  of  the  collection  are  manipulated. 
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Parallelism  in  problem  solutions  can  be  discovered  and 
described  by  examining  the  way  in  which  aggregates  or 
collections  of  data  objects  can  be  manipulated.  An  aggre¬ 
gation  of  data  which  results  in  aggregate  tokens  (or 
groups)  of  the  data  objects  which  are  computationally 
independent  represents  a  set  of  data  aggregates  which  can 
be  scheduled  in  parallel  or  concurrently.  These  aggregate 
tokens  can  be  homogeneous,  like  in  type  and  semantic 
meaning,  or  non  homogeneous.  We  will  restrict  our  discus¬ 
sion  here  to  homogeneous  aggregate  tokens. 

A  series  of  examples  from  both  office  automation  and 
numerical  linear  algebra  will  be  used  to  demonstrate  the 
concept  of  abstraction  in  a  problem  solution.  It  is  interest¬ 
ing  to  note  that  the  model/method  developed  in  Section  2 
is  equally  effective  in  both  of  these  areas. 

The  payroll  check  calculation  example  (See  Figure  6)  is  one 
example  of  a  transaction  processing  problem  solution. 
Transaction  processing  means  that  the  calculation  of  each 
unit  of  computation,  e.g..  a  payroll  check,  is  independent 
of  the  computation  of  all  others  units.  In  such  situations 
the  input  can  be  divided  by  partitioning  the  input  file  in 
any  fashion  without  affecting  the  output  results.  There 
are  consequences  of  the  level  of  aggregation.  For  instance 
if  the  pay-file  was  divided  into  1000  aggregates  which 
form  a  partition  then  one  cou’d  cause  1000  different  aggre¬ 
gates  to  be  sent  to  other  processors.  While  in  Figure  6 
there  are  only  2  data  object  aggregates  (tokens)  which  can 
be  sent,  thereby  reducing  communication  overhead. 

There  is  no  need  for  the  data  aggregates  to  form  a  partition 
of  the  data  although  the  potential  for  indeterminate  com¬ 
putation  may  occur  if  the  problem  solver  creates  code  out¬ 
side  of  the  determinant  core.  For  instance,  there  could  be 
more  than  one  source  for  a  given  result  (see  the  Averages 
example). 

Consider  examples,  this  time  involving  a  two  dimensional 
collection  of  homogeneous  data  objects  which  form  a 
matrix.  In  parallel  solutions  to  numerical  linear  algebra 
problems,  the  questions  of  what  computation  can  be  done 
in  parallel  and  what  degree  of  aggregation  of  the  data 
should  be  used  arise.  The  first  question  deals  with  locat¬ 
ing  the  inherent  parallelism  in  a  problem  solution.  The 
second  question  addresses  the  issue  that  a  particular 
numerical  method  will  be  of  interest  to  people  dealing 
with  both  small  and  large  dimensional  matrices  and  the 
fact  that  one  cannot  expect  to  have  an  infinite  number  of 
proce.s.sors  available. 

The  repetitive  computation  of  initializing  each  element  of 
the  matrix  to  the  same  value  can  be  thought  of  as  creating 
aggregate  tokens  which  contain  exactly  one  element  of  the 
matrix  and  scheduling  the  entire  initialization  to  occur 
concurrently.  This  solution  represents  the  maximum 
amount  of  potential  parallelism.  In  many  situations,  the 
dimension  of  the  matrix  will  greatly  exceed  the  number  of 
available  proce.ssors  or  this  small  level  of  granularity  will 
be  impractical  becau.se  of  interprocessor  communication 
costs.  Another  solution  to  this  problem  would  be  to 
aggregate  pieces  of  the  matrix  into  aggregate  tokens  and  to 
perform  the  initialization  on  the  tokens  in  the  aggregate 
tokens  .sequentially.  The  initialization  of  each  aggregate 
token  can  occur  in  parallel. 


A  variation  on  the  previous  example  would  be  the  initiali¬ 
zation  of  a  matrix  to  the  identity  matrix.  Again  the  max¬ 
imum  degree  of  parallelism  would  occur  by  letting  each 
element  form  an  aggregate  token  and  initialize  everything 
at  once  assuring  that  the  partitions  containing  the  diagonal 
elements  were  assigned  a  one  and  everything  else  a  zero. 
An  alternate  aggregation  of  the  matrix  elements  could  con¬ 
sist  of  forming  an  aggregate  token  that  contains  the  diago¬ 
nal  elements  and  one  or  more  aggregate  tokens  that  collects 
together  the  off-diagonal  elements  of  the  matrix.  Initiali¬ 
zation  would  then  occur  sequentially  within  each  aggregate 
token. 

The  calculation  of  X  X  using  this  outer  product  is  an 
example  where  one  can  consider  the  solution  at  several 
levels  of  aggregation:  for  simplicity  we  illustrate  this  with 
X.  a  3  X  3  matrix.  Figure  8  illustrates  the  calculation 
oased  upon  data  object  aggregates  which  are  rows.  Figure 
9  is  a  detail  of  the  outer  product  calculation  for  the  first 
row  of  the  matrix  X.  Clearly  if  X  were  larger  we  might 
group  the  rows  together  as  shown  in  Figure  10  and  then 
send  these  aggregate  token  to  different  processors  for 
potentially  concurrent  computation. 


Madd  -  is  an  action 
that  is  an  element  hy 
element  add  for  matrices 

Figure  8:  Outer  product  for  XX  based  or  row  partitions 


158 


XXX  XXX 


4.  Virtual  Algorithm 

Jamieson  (1987)  proposes  that  for  any  problem  solution 
approach  there  is  a  virtual  algorithm.  She  also  proposes 
that  this  virtual  algorithm  can  be  mapped  to  one  of  a 
number  of  architecture  specific  algorithms.  Jamieson’s 
Virtual  Algorithm  Approach  is  depicted  in  Figure  11. 

The  virtual  algorithm,  for  those  problem  solutions  that 
require  no  iterative  computation,  is  defined  by  mapping  the 
inputs  directly  to  the  outputs,  recognizing  the  renaming 
and  use  of  the  inputs  in  intermediate  computations.  In 
terms  of  the  methodology  discussed  in  this  paper,  this 
means  expressing  the  problem  solution  with  the  finest 
degree  of  detail  and  complete  disaggregation  of  the  data 
objects.  Obviously  this  is  a  formidable  task  and  two  prac¬ 
tical  questions  arise.  First,  is  finding  the  virtual  algorithm 
useful?  Second,  since  detailing  is  the  reverse  of  aggrega¬ 
tion.  is  it  possible  to  gleen  the  useful  information  from  the 
virtual  algorithm  expressed  at  a  higher  level  of  abstrac¬ 
tion? 

If  virtual  algorithms  could  be  found  and  expressed  in  a 
reasonable  way.  all  the  inherent  parallelism  in  a  numerical 
method  could  be  understood  and  all  possible  sequencings 
of  the  computations  could  be  defined.  'We  will  use  a 
graphical  representation  of  the  Cholesky  decomposition  of 
a  matrix  to  study  the  usefulness  of  a  virtual  algorithm. 
The  answer  to  the  second  question  remains  open. 

First  we  will  consider  the  dataflow  of  two  well  known 
Cholesky  decomposition  algorithms.  The  first  is  a  tradi¬ 
tional  method  given  in  Figu'e  12a  and  b  and  the  second 
given  in  Figure  13a  and  b  was  discussed  by  O'Leary  and 
Stewart  (1986).  Neither  of  these  dataflow  diagrams 
represent  the  virtual  algorithm  for  this  numerical  method 
in  that  the  renaming  and  use  of  the  original  inputs  has 
been  ignored.  Figure  14  represents  the  virtual  algorithm 
for  this  numerical  method.  Within  the  dataflow  graph  of 
Figure  14.  each  primitive  action  on  the  original  inputs  is 
represented  at  the  earliest  time  frame  (level)  in  which  the 
input  for  that  action  is  available  and  for  which  the 
corresponding  predicates  are  satisfied.  The  diagram  of  the 
virtual  algorithm  maintains  the  vertical  positioning  of 
actions  corresponding  to  tim..  Cbst.  vc  that  tht  dfaflow 
of  the  traditional  algorithm  of  Figure  11b  is  equivalent  to 
sequencing  the  computation  according  to  horizontal  planes 
cutting  the  diagram  at  each  time  frame.  The  O'Leary  and 
Stewart  method  of  Figure  13b  is  also  evident  in  the  virtual 
algorithm  diagram.  That  computation  proceeds  in  the 
order  given  by  the  vertical  planes  shown  in  Figure  14. 

5.  Conclusions 

This  research  is  direr  !  toward  the  di.scoveiy  of  inherent 
parallelism  (or  the  viitual  algorithm)  for  a  given  problem 
.solution  approach.  A  concurrent  method/model  which  has 
a  graphical  form  and  a  linear  syntactic  form  has  been 
presented  which  can  be  used  as  a  tool  for  parallel  algo¬ 
rithm  development.  One  advantage  of  the  location  of  such 
virtual  algorithms  is  the  potential  of  mapping  these  algo¬ 
rithms  to  optimal  architecture  specific  algorithms. 
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1  Introduction  and  Summary 

Solutions  to  fixed  point  problems,  solutions  of 
equations,  and  maocimizations  often  involve  itera¬ 
tive  schemes.  When  each  iteration  consists  of  eval¬ 
uating  a  vector  of  functions,  the  possibility  exists 
for  evaluating  the  coordinates  of  that  vector  asyn¬ 
chronously,  that  is,  not  necessarily  all  at  the  same 
time.  For  example,  consider  the  following  iterative 
method  of  finding  the  largest  eigenvalue  and  corre¬ 
sponding  eigenvector  for  a  symmetric  m  x  rn  matrix 
A,  starting  with  a  vector  x°: 


0,4+1  =  maoc!/j 


Under  certain  general  conditions,  the  sequence 
ci,C2,...  is  known  to  converge  to  the  eigenvalue 
of  A  with  largest  absolute  value  and  the  sequence 
x^x^, . . .  converges  to  a  corresponding  eigenvector. 
Now  suppose  that  m  is  even  and  we  write 


where  each  of  /lo  and  Ai  are  (m/2)  x  m  matrices. 
We  can  form  the  sequence  of  “partial”  iterations 
x',x^, . . .,  where 


Each  iteration  here  only  calculates  new  values  for  half 
of  the  vector,  keeping  the  other  half  the  same  as  the 
previous  iteration. 


An  obvious  question  arises  as  to  what  is  to  be 
gained  by  such  a  partial  iteration  scheme.  For  ex¬ 
ample,  in  the  eigenvalue  calculation,  might  it  be  that 
it  takes  fewer  than  twice  as  many  “half  iterations” 
to  achieve  the  same  degree  of  convergence?  Could  it 
be  that  the  partial  iterations  do  not  even  converge? 
In  this  paper,  we  provide  some  theoretical  and  some 
empirical  answers  to  questions  of  this  sort.  One  situ¬ 
ation  in  which  partial  iterations  have  a  great  deal  of 
potential  is  in  a  peuallel/distributed  computing  envi¬ 
ronment.  For  example,  in  the  eigenvalue  calculation, 
if  one  had  two  processors  available,  one  could  assign 
all  iterations  invoving  Ao  to  one  processor  and  all  it¬ 
erations  involving  Aj  to  the  other  one.  The  sequence 
of  iterations  would  not  be  the  same  eis  that  described 
above  if  both  processors  were  allowed  to  work  at  the 
same  time.  The  reason  is  that  iterations  n  and  n  -f  1 
might  be  proceding  simultaneously,  hence  iteration 
n-f-1  could  not  be  a  function  of  iteration  n.  If  the  pro¬ 
cessors  ran  at  different  speeds,  the  iterations  would 
not  even  alternate  between  the  two  halves  of  the  vec¬ 
tor.  For  this  reason,  such  a  sequence  of  iterations  is 
asynchronous. 

In  Section  2,  we  give  a  precise  definition  of  asyn¬ 
chronous  iterations  and  the  types  of  problems  in 
which  they  have  been  applied.  In  Section  3,  we  give 
examples  of  some  asynchronous  iteration  schemes 
which  can  be  used  in  most  iterative  problems.  In 
Section  4,  we  present  some  theorems  giving  condi¬ 
tions  under  which  asynchronous  iterations  converge. 
In  Section  5,  we  desribe  the  example  calculations  we 
performed.  These  calculations  are  all  based  on  the 
eigenvalue  problem  described  above.  In  Section  6,  we 
briefly  describe  a  video  animation  system  and  some 
videotapes  we  created  to  help  visualize  the  sequence 
of  asynchronous  iterations. 

2  Definitions  and  Notation 

Consider  a  mapping  from  a  subset  D  of  u- 
dimensional  Euclidean  space  3?"  to 

F-{F,,  ,  P  )  ;  D  — 

We  will  consider  the  problem  of  finding  a  fixed  point 
of  this  mapping  by  means  of  successive  iteration.  The 
idea  is  that,  since  a  fixed  point  x  satisfies  F(x)  = 
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X,  then  starting  at  an  arbitrary  point  x°,  we  could 
successively  calculate  x-’  =  If  the  sequence 

{x''}^g  converges,  it  must  converge  to  a  fixed  point. 

Here,  we  consider  a  more  general  sequence  of  iter¬ 
ations  called  asynchronous  iterations. 

Definition  1 

A  sequence  of  iterations  x° ,  x' , . . . ,  x-* , . . .  are  called 
asynchronous  iterations  if 


*  i  ^  Lj 

iehj. 


where  Lj  is  a  nonempty  subset  of  listing 

the  components  of  x  updated  at  iteration  j  and  the 
numbers  are  integers  indicating  which  iterate  of  x^ 
to  use  at  iteration  j.  We  require  that  <  j  —  1  so 
that  the  procedures  can  be  realized  in  practice. 

For  convenience  we  define 

L  =  {Lj|j  =  1,2,. . .}  and 
S  =  {(5i,...,4)lj  =  1,2,..  }, 

The  types  of  functions  F  we  will  consider  are  Ltp- 
schiUian  contractions.  These  are  a  special  class  of 
Lipschitzian  operators. 

Definition  2  A  function  F  ;  Z?  — »  5?"  is  a  Lips¬ 
chitzian  operator  if  there  exists  an  n  x  n  matrix  A 
with  non-negative  entries  such  that  |F(x)  —  F(y)|  < 
A|x  —  y|  where  |  •  [  and  <  are  taken  component-wise. 
The  matrix  A  is  called  the  Lipschitzian  matrix  for 
the  operator  F. 

Definition  3  A  function  F  ■.  D  ^  W'  \s  9.  Lips¬ 
chitzian  contraction  if  F  is  a  Lipschitzian  operator 
with  matrix  A  having  p{A)  <  1,  where  p{  )  is  the 
spectral  radius  of  its  matrix  argument. 

3  Examples 

In  this  section,  we  present  examples  of  this  general 
class  of  iterations,  some  of  which  can  take  advantage 
of  distributed  computation.  Decause  the  definition 
of  Lipschitzian  contraction  requires  the  operator  to 
be  essentially  linear  (at  least  locally)  we  will  consider 
here  ltcr  ei.,;ig  ^;f  the  form 

f  xr '  f-j 


i  e  Lj , 


where  A,  is  the  i'^  row  of  a  matrix  A. 


3.1  Jacobi  Iterations 

Jacobi  iteration  is  a  standard  procedure  for  solving 
linear  systems  and  is  sometimes  called  the  method  of 
simultaneous  displacements.  We  describe  a  few  of  its 
variants  here. 

Sequential  vector-wise  evaluation:  This  iterative 
scheme  is  designed  to  run  in  a  single  process.  W'e 


Lj  =  { 1 , . . . ,  n} 

Si  =  J  -  1- 

Here,  every  coordinate  of  F  is  evaluated  at  the  same 
vector  x-’''  =  {x{~^ , . . . ,  xF~'^)  and  then  the  entire 
vector  is  updated  to  x-’  =  {x\ , . . . ,  x^).  Thus,  n  com¬ 
ponents  are  updated  per  iteration.  This  is  the  stan¬ 
dard  method  of  iteration  mentioned  at  the  beginning 
of  Section  1. 

Independent  component-wise  evaluation:  If  n  pro¬ 
cesses  are  available,  each  one  can  be  devoted  to  up¬ 
dating  a  separate  coordinate.  That  is,  we  use 

Lj  =  {(j  -  1  mod  n)  -I-  1} 

What  happens  here  is  that  no  updated  coordinate  is 
used  in  any  iteration  until  every  coordinate  has  been 
updated.  That  is,  every  iteration  consists  of  updating 
a  single  coordinate,  and  then  every  n  iterations  all  of 
the  updated  coordinates  become  available  for  future 
iterations.  Thus,  one  component  is  updated  per  iter¬ 
ation  and  the  processes  must  be  “synchronized”  after 
every  n  iterations;  that  is,  the  n-l-lst  iteration  cannot 
proceed  until  the  first  n  have  all  been  completed. 

Independent  block-wise  evaluation:  If  k  processes 
are  available  and  n  =  qk,  then  each  process  can  eval¬ 
uate  q  coordinates  at  a  time.  Here  we  use 

Lj  =  {i|m(7  +  1  <  i  <  (m  -|-  l)g} 

^ 

where  m  =  (j  —  1  mod  k).  Instead  of  updating  only 
one  coordinate  per  iteration,  we  evaluate  q  coordi¬ 
nates  per  iteration.  But  we  still  wait  until  all  n  coor¬ 
dinates  get  updated  the  same  number  of  times  before 
releasing  the  updated  values  for  use  by  the  next  iter¬ 
ation.  Thus,  q  components  are  updated  oer  iteration 
and  the  proresoes  musi.  be  synchronized  aiU.r  every 
k  iterations. 

3.2  Gauss-Scidol  Iterations 

^iauss-.Siedel  iteration  is  another  standard  proce¬ 
dure  for  solving  linear  systems  and  is  sometimes 


called  the  method  of  succesive  displacments.  Gauss- 
Siedel  iteration  is  generally  considered  preferable  to 
Jacobi  iteration  for  solving  linear  systems.  We  de¬ 
scribe  a  few  of  its  variants  here. 

Sequential  component-wise  evaluation:  At  each  it¬ 
eration  a  single  component  of  F  is  updated  making 
use  of  the  most  recent  values  of  all  the  other  com¬ 
ponents.  In  this  method,  we  do  not  wait  for  every 
iteration  to  release  updated  coordinates  for  use 
by  the  next  iteration.  We  use 

Lj  =  {(j  -  1  mod  n) -I- 1) 

Si  =  Jf  -  1- 

That  is,  the  coordinates  are  updated  in  sequence,  one 
at  a  time,  but  as  soon  as  a  coordinate  is  updated,  it  is 
used  in  all  future  iterations.  In  the  Jacobi  methods, 
it  was  always  the  ceise  that,  for  j  sufficiently  large, 
x-'  =  F{x‘)  for  some  I  <  j.  In  general,  this  will  not 
be  true  for  Gauss-Seidel  iteration. 

Sequential  block-wise  evaluation:  At  each  iteration 
a  block  of  coordinates  of  size  q  is  updated  making 
use  of  the  most  recent  values  of  all  the  components 
not  in  the  block.  At  the  end  of  each  iteration  the 
new  coordinates  are  leased  for  use  in  calculating  the 
iterates  for  other  blocks.  We  can  use 

Lj  =  {j'lmj  -I-  1  <  i  <  (m  -h  l}q} 

4  =  J  - 1 

where  m  —  (j  —  1  mod  /r).  At  iteration  j  each  coordi¬ 
nate  in  a  block  is  updated  based  on  the  same  starting 
vector  All  future  iterations  make  use  of  these 

updated  coordinates  and  q  components  are  updated 
at  each  iteration. 

3.3  Random  Iterations 

Randomness  can  enter  into  iterative  schemes  in  one 
or  both  of  two  ways.  Either  the  components  to  be 
updated,  Lj,  may  be  uncertain,  or  the  iterates  to  use 
{s^}  may  be  uncertain,  or  both.  The  reasons  why 
either  or  both  of  these  items  is  uncertain  may  vary 
from  one  iterative  scheme  to  the  next.  We  will  de¬ 
scribe  two  such  schemes.  In  each  of  the  schemes  de¬ 
scribed  below,  the  randomness  enters  solely  through 
uncertainty  about  the  order  of  completion  of  the  iter¬ 
ations.  For  this  reMon,  we  introduce  the  set  Cj  as  the 
set  of  indices  of  all  iterations  which  have  completed 
at  the  time  that  iteration  j  begins.  Tor  example,  in 
the  sequential  evaluation  schemes  described  in  Sec¬ 
tion  3.1  and  Section  3.2,  Cj  is  always  {!,...  ,y  —  1}. 
In  the  independent  (Jacobi)  schemes,  Cj  would  be 
a  proper  subset  of  {!,..., j  —  1}.  For  brevity,  we 
only  describe  block-wise  evaluation  schemes  in  this 


section  because  component-wise  schemes  are  special 
cases  with  one  component  per  block. 

Asynchronous  fixed  block-wise  evaluation:  If  k  pro¬ 
cesses  are  available,  separate  the  n)  into  k 

disjoint  blocks.  Each  process  can  be  assigned  one 
of  the  blocks  of  coordinates.  Each  process  updates 
the  same  coordinates  at  each  of  its  iterations,  using 
the  latest  available  iterates  of  all  coordinates.  A  pro¬ 
cess  beginc  a  new  iteration  as  soon  ais  it  finishes  an 
old  one.  When  each  block  htis  q  coordinates,  so  that 
n  =  qk,  we  can  express  this  by 

=  {*'10'  -  1)?  +  1  <  *  <  i?},  1  <  J  <  ^ 

-  Lj,  j  >  k 

sj  =  ma.x{k  E  Cjli  €  Lk}, 

where  J  is  the  random  iteration  number  of  the  most 
recently  completed  iteration.  Here,  the  block  of  co¬ 
ordinates  to  be  updated  at  iteration  j  is  uncertain 
due  to  the  fact  that  we  do  not  know  which  of  the 
k  ongoing  iterations  (processes)  will  finish  next  (and 
hence  begin  the  next  iteration).  After  each  iteration 
completes,  the  newly  updated  coordintates  become 
available  for  use  at  all  future  iterations. 

Asynchronous  cyclic  block-wise  evaluation:  If  k 
processes  are  available,  and  n  =  kq,  the  coordinates 
are  divided  into  blocks  of  size  q  and  the  blocks  are 
updated  cyclicly.  Each  iteration  consists  of  updating 
the  next  block  of  coordinates.  Each  process  uses  the 
latest  available  iterates  of  all  coordinates.  A  process 
begins  the  next  iteration  in  sequence  as  soon  as  it 
finishes  an  old  one.  We  can  express  this  by 

Lj  =  {i\mq -f  I  <  i  <  (m \)q} 

4  =  max{fc  6  Cj  |i  G  Z/jt} 

where  m  =  (j  —  1  mod  k).  Here,  the  block  of  coor¬ 
dinates  to  be  updated  at  iteration  j  is  known  due  to 
the  cyclic  nature  of  the  scheme.  But  which  iterate  of 
each  coordinate  to  be  used  in  the  next  iteration  is  un¬ 
certain  until  we  know  which  previous  iterations  have 
finished.  After  each  iteration  completes,  the  newly 
updated  coordintates  become  available  for  use  at  all 
future  iterations. 

Obviously,  none  of  the  block  evaluation  schemes  re¬ 
quire  that  n  =  kq.  However,  Lj  is  simpler  to  express 
when  n  =  kq. 

4  Theoretical  Results 

Several  authors  have  proven  that  asynchronous  it¬ 
erations  converge  under  certain  conditions.  These 
conditions  generally  involve  the  number  of  times  each 
coordinate  is  updated,  and  how  large  gets.  In 
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Section  4.1,  we  present  two  previous  results  on  the 
convergence  of  asynchronous  iterations  which  impose 
deterministic  criteria  on  the  performance  of  the  iter¬ 
ation  scheme.  In  Section  4.2,  we  discuss  probabilis¬ 
tic  criteria  which  lead  to  almost  sure  convergence  of 
asynchronous  iterations. 

4.1  Deterministic  Results 

The  following  conditions  are  assumed  in  the  first 
two  theorems. 

1.  limj^oo  =oo,V2 

2.  2  €  Lj  infinitely  often. 

The  first  of  these  conditions  guarantees  that  the  ul¬ 
timate  step  of  the  iteration  depends  on  penultimate 
steps  rather  than  very  old  steps.  The  second  of  these 
conditions  guarantees  that  every  component  will  be 
updated  many  times.  Chazan  and  Miranker  (1969) 
proved  a  theorem  concerning  affine  functions. 

Theorem  4  Chcizan  and  Miranker  (1969).  If 
F(x)  =  Ax  -(-  b  then  the  asynchronous  iteration  con¬ 
verges  if  and  only  if  p{A)  <  1. 

Baudet  (1975)  was  concerned  with  Lipschitzian  con¬ 
tractions. 

Theorem  5  Baudet  (1975).  If  F  is  a  Lipschitzian 
contraction  then  the  asynchronous  iteration  con¬ 
verges  to  the  unique  fixed  point  of  F. 

The  third  theorem,  due  to  Lubachevsky  and  Mitra 
(1986),  applies  only  to  finding  the  fixed  point  of  a 
matrix  A  =  ((a,,,))  with  p(A)  =  1.  Here  F(x)  =  Ax. 
In  this  theorem,  for  each  2  €  L;,  is  allowed  to 
depend  on  i.  That  is. 


Theorem  6  Lubachevsky  and  Mitra  (1986).  Sup¬ 
pose  A  IS  a  non-negative  irreducible  matrix  and  as¬ 
sume  there  is  i  such  that  an  >  0,  x°  >  0.  and 
fd-li)  ~  j  —  I  for  all  j  >  0,  Then  the  asynchronous 
iteration  converges  to  a  scalar  multiple  of  the  fixed 
point  of  A. 

4.2  Probabilistic  Results 

There  are  two  types  of  probabilistic  results  with 
which  we  will  deal.  The  distinction  depends  on  the 
relationship  between  the  way  Lj  is  chosen  and  the 
times  taken  to  complete  the  first  j  —  1  iterations.  We 
define  the  y'*"  service  time  to  be  the  time  from  the 


start  of  the  iteration  until  its  completion.  The 
first  type  of  result  deals  with  the  case  in  which  the  Lj 
are  chosen  independently  of  the  service  times.  The 
second  type  of  results  allow  the  Lj  to  depend  on  the 
service  times.  Throughout  this  section,  we  assume 
that  the  service  times  are  finite  almost  surely.  We  will 
also  assume  that  s^  =  max{A;  6  Cj\i  G  L*},  so  that 
there  is  no  chance  of  a  coordinate  “getting  stuck”  at 
an  old  value  when  newer  updates  are  available.  The 
only  thing  required  in  order  to  guarantee  that  the 
two  conditions  at  the  beginning  Section  4.1  will  hold 
with  probability  1  is  that 

Pr(2  G  Lj,  for  infinitely  many  j) 

=  1  for  each  2  =  l,...,n.  (1) 

We  will  consider  schemes  which  guarantee  (1)  both 
when  Lj  is  independent  of  the  service  times  and  when 
Lj  depends  on  the  service  times. 

4.2.1  Lj  Independent  of  Service  Times 

Here  we  will  describe  some  schemes  which  are  de¬ 
signed  to  guarantee  (1).  The  basic  idea  of  these 
schemes  is  to  choose  the  Lj  j  =  I, ...  in  such  a  way 
that  each  coordinate  has  a  positive  probability  of  be¬ 
ing  in  L;  for  j  =  1:,. . .  ,ik  +  m  for  all  sufficiently  large 
k  and  some  finite  m,  and  to  be  sure  that  the  proba¬ 
bility  of  each  coordinate  being  in  Lj  does  not  go  to  0 
as  j  increases.  In  this  case,  the  law  of  large  numbers 
will  assure  that  each  r  appears  in  infinetly  many  Lj 
with  probabilty  1.  One  way  to  arrange  this  would  be 
to  choose  a  collection  of  r  subsets  of  say 

M\ , . . . ,  Mr ,  such  that 

r 

{l,...,n)  =  y  Mf 
2=1 

Then  let  Lj  be  a  random  choice  from  M\ , . . . ,  Mr  ■  If 
the  choices  are  made  independently  and 

p,  =  Pr(/.,  =  M,)  >  0 

for  each  t  and  all  j,  then  the  law  of  large  numbers 
guarantees  that  (1)  holds  with  probability  1. 

There  is  another  class  of  schemes,  which  we  will 
call  Markov  sehemes,  which  also  guarantee  (1).  If  we 
let 

pt,,  =  Pr(Lj  =  =  A/,) 

for  all  j  >  k,  then  we  can  state  some  sufficient  con¬ 
ditions  for  (1)  to  hold.  For  example,  if  the  transition 
matrix  F  —  ((pi,, ))  is  regular  (i.e.  P'"  has  all  non¬ 
zero  entries  for  some  rn)  then  (1)  holds  because  each 
coordinate  has  some  positive  probability  of  appearing 
in  at  least  one  of  the  next  m  Lj,  and  the  probability 
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does  not  go  to  0  els  j  increeises.  Also,  if  P  ^  but 
P'”  =  I  for  some  m,  then  (1)  holds.  Asynchronous 
cyclic  block-wise  evaluatio.i  is  such  as  scheme.  It  cor¬ 
responds  to  r  =  k  and 

_r  1  if<  =  (s-bl  mod  k) 

^0  otherwise. 

In  this  case  P*  =  /.  Many  other  block-wise  schemes 
are  available  among  the  Markov  schemes,  including 
both  deterministic  and  random  choices  of  Lj. 

4.2.2  L,  dependent  on  Service  Times 

When  the  Lj  are  dependent  on  the  service  times, 
various  difficulties  can  arise.  For  example,  a  silly 
algorithm  for  choosing  Lj  would  be,  for  j  >  10,  if 
any  of  the  first  10  completed  service  times  is  greater 
than  14  seconds,  Lj  =  {!}.  Assuming  that  the  ser¬ 
vice  time  distribution  had  postive  probability  beyond 
14  seconds,  there  would  be  positive  probability  that 
all  coordinates  other  than  1  would  be  updated  only 
finitely  often.  Rather  than  try  to  construct  neces- 
sarv  conditions  for  ruling  out  this  type  of  behavior, 
we  propose  simple  sufficient  conditions. 

Suppose  that  we  choose  r  subsets  of  {1, . . .  ,n},  say 
Ml, . . .  ,Mr,  such  that 

r 

{l,...,n} 

(=1 

and  each  Lj  is  required  to  be  one  of  the  Mf  One 
way  in  which  the  Lj  can  be  dependent  upon  the  ser¬ 
vice  times  is  for  Lj  to  be  a  function  of  which  Mt  was 
updated  in  the  iteration  which  most  recently  com¬ 
pleted. 

Asynchronous  fixed  block-wise  evaluation  is  an  ex¬ 
ample  of  this  type  of  scheme,  in  which  Lj  is  exactly 
that  A/(  which  wels  updated  by  the  iteration  which 
most  recently  completed.  This  requires  that  r  =  k 
and  that  the  first  k  of  the  Lj  are  Mi, ,  M*  in  some 
order.  This  scheme  has  the  property  that  (1)  holds. 
There  are  other  such  schemes  for  which  (1)  holds. 
That  is,  suppose  that  r  —  k  and  that  the  first  k  of 
the  Lj  are  Mi ,  . . . ,  Mk  in  some  order.  Let 

. A-) 

be  a  one-to-one  function.  For  j  >  k,  let  inj  be 
the  number  of  the  iteration  which  completes  just 
before  iteration  j  begins.  Then  (1)  will  hold  if 
Lj  =  Mji^j,  where  L„i,  =  Mp.  There  are  k\  such 
schemes  and  asynchronous  fixed  block-wise  evalua¬ 
tion  corresponds  to  /(?')  =  i,  for  i  =  1 . k. 


5  Empirical  Results 

In  this  section  we  describe  the  test  cases  which 
we  ran  to  compare  the  performance  of  synchronous 
and  ELsynchronous  iterations  on  a  parallel/distributed 
system.  The  particular  system  of  processors  used  in 
the  computations  is  described  by  Eddy  and  Schervish 
(1986)  and  has  been  used  in  several  statistical  ap¬ 
plications  (Eddy  and  Schervish,  1987  and  Schervish, 
1988).  A  brief  description  follows. 

5.1  The  Distributed  System  Used 

The  parallel/distributed  system  used  in  the  exam¬ 
ples  of  this  paper  is  a  special  case  of  a  master-slave 
system.  In  a  master-slave  system,  one  process  acts 
like  a  niEister,  keeping  track  of  control  information, 
such  as  L  and  S  and  which  iterations  are  outstanding. 
The  slave  processes  perform  the  bulk  of  the  numerical 
calculations,  such  as  function  evaluations  and  matrix 
multiplications.  The  system  of  Eddy  and  Schervish 
(1986)  uses  the  DECnet  communication  protocol  be¬ 
tween  VAX  computers  running  the  VMS  operating 
system.  The  master  process  communicates  with  the 
slaves  by  writing  to  and  reading  from  network  devices 
(DECnet’s  way  of  defining  communication  channels). 

Data-flow  is  implemented  by  having  some  of  the 
reading  and  writing  done  asynchronously.  For  exam¬ 
ple,  the  master  begins  by  assigning  a  task  to  each 
slave.  This  is  done  by  writing  the  appropriate  data 
and/or  instructions  to  the  network  device  associated 
with  each  slave.  The  master  then  reads  from  the  net¬ 
work  device,  but  does  not  wait  for  a  response.  Figu¬ 
ratively  speaking,  the  msister  says  “Let  me  know  as 
soon  Eis  something  arrives.”  Then  the  master  goes  on 
to  the  next  slave.  When  “something  arrives”  from  a 
slave,  the  msister  deals  with  the  response  and  sends 
another  task  (if  any  remain)  in  the  same  way  as  be¬ 
fore.  On  the  other  hand,  each  slave  begins  by  reading 
from  the  network  device  and  waiting  for  a  task  to  ar¬ 
rive  from  the  master.  It  then  does  its  work,  writes  its 
response  to  the  network  device,  and  waits  for  another 
task.  When  the  work  is  finished,  the  master  can  re- 
lea.se  the  slaves  or  keep  them  waiting  for  a  brand  new 
.set  of  tasks. 

5.2  The  Example  Matrix 

We  used  three  different,  iterative  schemes  for  find¬ 
ing  the  largest  eigenvalue  and  corresponding  eigen¬ 
vector  (henceforth  called  the  laryc.st  eigenvector)  of  a 
matrix  /I.  I'he  matrix  is  a  199  x  499  circulant  with 
(.9)h“-'l  in  the  {i,j)  entry,  d'he  iterative  methods  we 
us('d  were  based  on  the  iterative  algorithm  (h'seribed 
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in  Section  I  and  letting  x°  be  any  vector  not  orthog¬ 
onal  to  the  largest  eigenvector. 

The  matrix  used  in  the  example  has  a  simple  eigen- 
structure  which  we  describe  here.  These  results  fol¬ 
low  from  the  theorems  of  Section  6.5.2  of  Anderson 
(1971).  Our  matrix  A  can  be  expressed  as 

249 

yl  =  /  +  X^(.9)M,- 

1=1 

where  Ai  is  a  matrix  whose  only  non-zero  entries 
are  Is  on  the  i**'  and  (250  —  sub-  and  super- 
diagonals.  That  is,  if  Ai  =  then 

1  if  \j  —  k\  =  i 
1  if  \j  -  i'l  =  250  -  i 
0  otherwise. 


and 

18.73970191995797. 

The  first  two  values  are  fairly  close  together,  their 
ratio  being  approximately  0.9859.  Even  the  fourth 
and  fifth  eigenvalues  are  large,  being  approximately 
0.9460  times  as  large  as  the  first  one. 

5.3  The  Test  Cases 

We  performed  asynchronous  iterations  in  double 
precision,  computing  both  the  vector  x-'  and  the  ap¬ 
proximate  eigenvalue  cj  until  the  following  conver¬ 
gence  criterion  was  met: 

Convergence  criterion:  Wait  until  every  co¬ 
ordinate  hais  been  updated  at  least  once  and 
stop  as  soon  as  both  of  the  following  two 
conditions  are  met: 


Theorem  6.5.3  of  Anderson  (1971)  says  that  the 
eigenvalues  of  Ai  are 


That  is,  all  but  the  largest  one  come  in  pciirs  of  two 
equal  eigenvalues.  The  eigenvectors  corresponding  to 
cos  (^^) ,  for  A:  >  0  are 


The  eigenvector  corresponding  to  the  largest  eigen¬ 
value  1  is  (1,1,...,!)^.  Note  that  all  A,  have  the 
same  eigenvectors.  Since  A  is  a  positive  linear  rom- 
bination  of  the  Ai,  the  largest  eigenvalue  of  A  is 
the  same  linear  combination  of  the  A:'*’  largest  eigen¬ 
values  of  the  A,.  That  is,  the  A'^^  largest  eigenvalue 
of  A  is 
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1  •+-  ^(-9)'  cos 
1  =  1 


In  particular,  the  first  three  eigenvalues  are  approxi¬ 
mately: 

18.99999999992728, 


18.73270191995797. 


"nll=d.|c,-il)  — 


<  10 


•16 


•  -if'  -  <  10-'®. 


here  j/f  =  j:*/max{|rf| . |4l}- 


The  first  condition  insures  that  the  approximate 
eigenvalue  has  not  changed  much  and  the  second 
insures  that  the  product  AX'’”'  is  approximately 

CjX-^-' . 

The  test  cases  described  here  are  the  same  cases 
used  in  the  videotape,  however  they  are  not  the  same 
runs  described  there.  The  reason  is  that  there  is  a 
significant  amount  of  time  required,  during  the  run, 
to  write  the  information  used  in  the  video  tape.  The 
more  processors  used  in  a  run,  the  more  iterations 
were  done,  and  the  more  writing  that  was  done. 
The  timings  would  not  be  indicative  of  the  savings 
achieved  by  multiple  processors  if  we  timed  the  writ¬ 
ing  of  the  videotape  information.  Because  the  runs 
are  not  the  same  and  the  environment  is  stochastic, 
the  numbers  of  iterations  will  be  different  also. 

5.3.1  Synchronous  Computation 

We  used  a  sequential  vector- vvise  Jacobi  scheme 
starting  with  x®  being  a  vector  of  numbers  between 
0  and  1,  each  chosen  by  a  uniform  pseudo-random 
number  generator.  'I'he  convergence  criterion  was 
met  after  2332  iterations  and  8hr  39min  of  wall-clock 
time  on  a  single  N'AXstation  2000  dedicated  to  the 
task. 

5.3.2  Asynchronous  Computation 

Eor  the  asynclironons  computations,  we  dividi'd 
the  vector  into  nearly  equal  snbveclors  and  npdati'd 
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one  subvector  per  iteration.  We  used  both  a  fixed 
allocation  and  a  cyclic  allocation.  In  the  fixed  allo¬ 
cation,  one  processor  was  devoted  to  each  block  of 
coordinates,  while  in  the  cyclic  allocation,  whenever 
a  processor  required  another  block,  it  was  assigned 
whichever  block  was  next  in  the  cycle. 

Fixed  Allocation  We  used  an  asynchronous  fixed 
block-wise  scheme  with  k  =  5  blocks,  4  of  size  100 
and  one  of  size  99.  After  2hrs  15min  and  23155  iter¬ 
ations,  the  convergence  criterion  was  met.  The  five 
processors  were  not  identical.  The  third  one  was  a 
VAXstation  3200  and  the  other  four  were  VAXsta- 
tion  2000s.  The  fifth  processor  was  busy  with  other 
work  (unrelated  to  our  calculations)  to  a  greater  ex¬ 
tent  than  the  other  four.  The  numbers  of  iterations 
performed  by  each  of  the  five  processors  were 

3753  3903  9413  3782  2304. 

Notice  that  the  smallest  number  of  iterations  is  about 
the  same  cis  the  number  of  iterations  required  by  the 
Jacobi  vector-wise  allocation.  The  starting  vector  x° 
for  this  calculation  was  a  ufiit  vector  with  1  in  the 
first  coordinate  and  0  elsewhere. 

Cyclic  Allocation  We  used  an  asynchronous 
cyclic  block-wise  scheme  with  k  =  10  blocks,  9  of 
size  50  and  one  of  size  49.  The  starting  vector  x° 
for  this  calculation  was  the  same  uniform  pseudo¬ 
random  vector  used  in  the  Jacobi  scheme.  After  Ihr 
9min  and  28818  iterations,  the  convergence  criterion 
was  met.  The  10  processors  were  all  VAXstation 
2000s  or  VAXstation  I  Is  and  the  numbers  of  itera¬ 
tions  perfomed  on  each  of  the  10  blocks  were 

2884  28T  2890  2888  2881 

2890  2881  2888  2877  2865. 

Since  the  ID  blocks  were  assigned  to  iterations 
cyclicly,  we  did  not  keep  track  of  how  many  itera¬ 
tions  were  performed  by  each  processor,  but  rather 
how  many  iterations  updated  the  coordinates  in  each 
block.  Notice  that  th^'  10  blocks  were  all  updated 
approximately  the  same  number  of  times  when  con¬ 
vergence  occurred.  Diere  are  two  reasons  why  the 
numbers  are  not  all  equal.  Most  obvious  is  that  there 
are  still  iterations  ongoing  when  the  convergence  cri¬ 
terion  is  met.  riiis,  however,  would  not  account  fora 
difference  of  35  iterations  between  two  blocks.  Such 
a  difference  is  due  to  the  natur('  of  the  asynchronous 
updating.  Su()pose  a  process  finislies  iteration  k  and 
begins  iteration  j.  If  this  processor  was  particularly 
slow  on  this  iteration,  it  may  he  that,  for  i  £  Li,, 
k  <  .  J’hal  is,  some  oi  lier  jirocessor  updated  the 


coordinates  in  block  L*  before  the  one  which  just  fin¬ 
ished,  and  the  iteration  which  just  finished  must  be 
ignored  (otherwise  it  might  be  a  “downdating”  rather 
than  an  updating), 

6  Animated  Videotape 

In  the  videotape  we  display  the  sequence  of  itera¬ 
tions  for  three  different  iterative  schemes  for  finding 
the  largest  eigenvalue  and  eigenvector  of  a  matrix  A 
described  earlier. 

6.1  The  Video  System 

A  512  X  512  X  8  pixel  video  frame  buffer  is  in¬ 
stalled  in  a  VAX  workstation  to  generate  an  RGB 
video  signal  under  program  control.  The  signal  is 
translated  by  an  encoder  to  NTSC  video;  NTSC  is 
the  United  States  standard  for  home  television.  The 
NTSC  video  signal  is  recorded  on  a  3/4  inch  Umatic 
VCR.  This  VCR  has  the  capability  to  edit  single 
frames  of  video  onto  the  tape  under  the  direction 
of  a  controller  in  an  IBM  PC/XT  which  follows  com¬ 
mands  generated  by  a  program  running  on  ihe  V.\,X. 

The  crucial  point  in  the  application  of  thh  system 
to  the  generation  of  video  tapes  is  that  the  computa¬ 
tions  involved  in  generating  a  video  image  are  quite 
separate  from  the  actual  recording  of  the  video  tape. 
A  typical  recording  cycle  requires  about  ten  seconds 
to  record  a  single  video  frame  because  of  the  time  re¬ 
quired  to  position  the  tape  in  the  VCR.  On  the  one 
hand  this  means  that  it  takes  a  long  time  to  generate 
a  video  tape  (about  5  hours  per  minute  of  completed 
tape).  On  the  other  hand  it  makes  a  clear  separa¬ 
tion  between  the  calculations  needed  to  generate  the 
image  and  the  actual  “vent  of  recording  it.  This  al¬ 
lows  fairly  massive  computations  to  be  involved  in 
the  generation  of  the  images  without  imposing  the 
visual  time-lag  in  viewing  the  resulting  pictures. 

C.2  Description  of  the  Animation 

Figure  1  exhibits  output  from  a  laser  printer  which 
shows  what  a  single  frame  of  the  video  tape  looks  like. 
This  single  frame  illustrates  the  values  of  the  com¬ 
ponents  of  a  particular  iterate.  There  is  a  bitmap 
which  is  512  x  256  pixels.  Each  of  the  512  columns 
is  used  to  disiilay  a  number.  The  256  rows  are  di¬ 
vided  into  64  groups  of  four  pixels  each.  Each  of  the 
64  groups  is  used  to  display  the  value  of  a  single  bit 
of  the  number  in  that  column;  all  four  pixels  within 
the  group  have  Ihe  same  color.  \  double  precision 
floating  point  number  lives  in  64  bits.  On  the  V.\X 
where  this  was  done  eight  of  the  64  bits  are  reserved 
for  exponent  and  are  ignored.  I  he  remaining  56  bits 
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Figure  1:  One  Iteration  From  Eigenvalue  Calculation,  Cyclic  Assignment 


(of  the  fraction)  are  displayed  with  the  most  signif¬ 
icant  bit  at  the  bottom  and  the  least  significant  bit 
at  the  top.  P’lgure  1  is  from  a  fixed-blockwise  evalu¬ 
ation,  and  the  five  shaded  regions  correspond  to  the 
five  blocks  and  the  five  respective  processors. 

In  order  to  understand  this  more  precisely,  look  at 
the  last  13  columns  on  the  right  of  the  bitmap.  Using 
white  for  1  and  black  for  zero  these  thirteen  column 
display  the  value 

1042.99999999992728 

The  56  bits  in  the  floating  point  representation  of 
this  number  are 

lOOOOOlOOlOlllllIIlllIllIllI 

1111101101111111110000000000 

In  the  videotape  one  should  read  bottorn-to-top  and 
white  stands  for  1.  The  rea.son  the  last  10  bits  are 
zeros  is  that  we  added  1024  to  the  value  18.99999... 
to  guarantee  that  all  number  had  the  same  exponent. 

Before  viewing  the  actual  video  tape  one  antici¬ 
pates  (because  of  the  standard  theory  of  convergence 
for  this  calculation)  seeing  a  “wave"  of  convergence 
sweep  from  bottom  to  top  of  the  bitmap  as  the  se¬ 
quences  of  iterates  converges  to  the  solution.  The 
video  tape  exhibits  exactly  this  convergence.  How¬ 
ever,  and  this  is  what  is  important  about  the  exam¬ 
ple,  one  also  notices  a  number  of  additional  features. 
First,  there  is  a  “cusp”  in  the  convergence;  in  Figure 


1  tliis  cusp  appears  approximately  50  columns  (10% 
percent  of  the  bitmap)  from  the  left  edge.  There  is 
also  an  “anitcusp”  approximately  300  columns  (60% 
percent  of  the  bitmap)  from  the  left  edge.  Standard 
theory  does  not  adequately  explain  the  presence  of 
these  features  although  they  are  clearly  related  to  the 
eigenvectors  associated  with  the  second  largest  (and 
smaller)  eigenvalues.  Second,  there  is  an  additional 
effect  which  is  also  visible  in  Figure  1  but  is  more  pro- 
nounc<'d  in  the  animated  sequence.  Approximately 
four  bits  above  the  “zone  of  convergence”  there  is  a 
very  high  frequency  band  of  alternating  bits.  The 
“buzzing"  of  this  band  is  very  distinctive  visually  in 
the  video  tape  and  has  no  explanation  known  to  us. 
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ABSTRACT 

We  discuss  results  relevant  to  a  class  of  neu¬ 
ral  networks  that  have  close  relationship  to 
existing  techniques  in  applied  statistics  such 
as  density  estimation,  CART  and  projection 
pursuit.  The  perspective  of  this  presentation 
is  from  that  of  approximation  theory.  We  in¬ 
dicate  how  some  statistical  methods  might  be 
used  to  shed  light  on  the  behavior  of  neural 
networks. 

1.  Introduction 

Neural  computing  is  a  general  approach  to 
computation  that  strives  to  use  networks  of 
simple  processing  elements  instead  of  tradi¬ 
tional  procedural  algorithms  to  implement  a 
desired  functional  input/output  relationship. 
Although  the  foundations  of  neural  computa¬ 
tion  go  back  over  thirty  years,  there  was  a 
long  period  in  the  1970's  during  which  inter¬ 
est  in  the  technology  dwindled  partly  because 
of  some  mathematically  demonstrable  limita¬ 
tions  described  by  Minsky  and  Papert  in 
[Mi69].  By  the  1980's,  researchers  became 
confident  that  some  of  the  limitations  de¬ 
scribed  by  Minsky  and  Papert  might  be  cir¬ 
cumvent^  by  making  the  underlying  neural 
networks  more  complex  (  see  for  example 
[PDP,Ho82,  Ho85]). 

There  are  two  major  ways  in  which  networks 
have  been  embellished  to  make  them  more 
powerful.  One  involves  the  introduction  of 
feedback  or  stochastic  mechanisms  into  the 
networks  thereby  making  them  dynamical 
systems  capable  of  more  complex  behavior. 
The  other,  on  which  we  focus  in  this  paper. 

This  research  was  partially  supported  by  Office 
of  Naval  Research  grant  N00014-87-K-0182 
and  National  Science  Foundation  grant  DCR- 
8619103. 


is  the  use  of  multilayered  networks  with 
theoretically  unbounded  order  in  the  sense  of 
[Mi69].  A  fundamental  advance  with  respect 
to  multilayered  networks  has  been  the  dis¬ 
covery  of  training  algorithms  that  have 
worked  well  empirically  in  many  applications 
[PDP]. 

This  paper  addresses  a  number  of  problems 
related  to  multilayered,  feedforward,  contin¬ 
uous  (MFC)  networks.  We  emphasize  this 
restriction  b^ause  many  different  ideas  fall 
under  the  general  rubric  of  neural  network 
theory  and,  while  we  acknowledge  the  ex¬ 
istence  of  other  technologies  (such  as  net¬ 
works  with  feedback,  various  associative 
memories,  Hopfield-Tank  optimization  net¬ 
works,  Boltzmann  machines,  etc.),  we  can¬ 
not  pretend  to  deal  with  them  all.  Moreover, 
we  believe  the  class  of  MFC  networks  to  be 
the  most  promising  for  time  series  and  other 
statistical  applications.  Indeed,  the  classical 
work  of  Widrow  on  adaptive  filtering  [Wi62] 
is  perhaps  the  simplest  manifestation  of  feed¬ 
forward  networks  applied  to  statistical  fil¬ 
tering  problems.  Ne^less  to  say,  those  ideas 
have  proven  to  be  extremely  useful  in  appli¬ 
cations  such  as  channel  equalization  and  echo 
cancelling  in  real-time  telecommunications 
settings  [Wi85j.  More  recently,  there  have 
been  some  interesting  empirical  studies  done 
in  nonlinear  time  series  prediction  that  indi¬ 
cate  some  potential  utility  of  neural  networks 
in  such  an  application  [1^87,  Mo88j. 

We  discuss  some  practical  issues  surround¬ 
ing  multilayered,  feedforward,  continuous 
networks  especially  in  the  context  of  known 
statistical  techniques.  The  first  question  to  be 
discussed  concerns  identifying  the  class  of 
problems  that  can  in  principle  be  solved  by 
MFC  networks.  On  that  point,  we  have  ob¬ 
tained  general  results  demonstrating  that,  at 


least  theoretically,  networks  with  single  in¬ 
ternal  hidden  layers  can  be  used  to  solve  any 
continuous  approximation  problem  [Cy88a, 
Cy88b].  Next,  we  discuss  the  class  of  prob¬ 
lems  that  are  feasibly  (as  opposed  to  theoreti¬ 
cally)  solvable  by  MFC  networks.  Finally, 
we  discuss  procedures  for  determining 
whether  a  candidate  problem  (as  presented  by 
empirical  input/output  data)  might  be  feasibly 
solved  by  MFC  networks. 

In  an  area  that  is  both  promising  and  contro¬ 
versial,  it  is  perhaps  important  to  outline  our 
perspective  and  philosophy  on  this  general 
area  of  research.  Our  primary  interest  has 
been  and  continues  to  be  the  investigation  of 
numerical  algorithms  for  signal  processing. 
We  believe  that  MFC  networks  offer  an 
interesting  and  potentially  powerful  tech¬ 
nology  for  solving  certain  signal  processing 
problems.  At  the  same  time,  there  are  a 
number  of  known  statistical  techniques  that 
share  many  basic  ideas  with  MFC  neural 
networks  -  namely,  density  estimation, 

CART  and  projection  pursuit  methods.  We 
will  attempt  to  bring  some  of  these  connec¬ 
tions  to  light. 

2.  Technical  Background 

The  neural  networks  of  interest  to  us  are 
multilayered  feedforward  continuous  (MFC) 
networks.  In  order  to  discuss  such  networks, 
we  introduce  the  notion  of  an  N-node. 


An  N-node  is  a  simple  computational  unit  that 
accepts  some  number  of  real-valued  inputs, 
applies  an  affine  transformation  to  the  inputs 
and  then  applies  some  fixed  nonlinear  func¬ 
tion  to  this  affine  transformation.  The  output 
of  an  N-node  is  the  output  of  the  nonlinear 
function.  In  the  sequel,  we  assume  for  sim¬ 
plicity  that  the  nonlinearity  is  fixed  for  all 
nodes  but  that  the  affine  transformations  are 
of  course  node  dependent.  (The  use  of  the 
same  nonlinearity  is  arguably  the  most  inter¬ 
esting  case  from  an  implementation  point  of 
view  since  then  all  nonlinear  components  are 
identical.) 

Figure  1  graphically  illustrates  an  N-node 
while  the  simple  function  that  an  N-node  im¬ 
plements  is  given  by 

m 

o(2yi^i  +®) 

i  =1 

Here  X=(x2,  X2,  ...  ,  Xj„  )  are  the  real 
valued  inputs  to  the  node,  Y  =(yj,  y^  ...  , 

)  are  real  valued  constant  weights,  0  is  a 
real  constant  and  o  is  some  univariate  func¬ 
tion.  The  quantities  y  2,  y2,  ...  >  y^  and  0 

determine  the  affine  transformation  at  the 
node.  An  MFC  network  is  built  from  such 
simple  N-nodes  by  composition  in  layers. 


Inputs 


+  0) 


Figure  1. 

Input-output  relation  of  a  single  neural  node 
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Figure  2 

A  sample  network  with  two  hidden  layers 


Figure  2  depicts  a  two  layered  MFC  network. 
Generalizations  to  networks  with  more  layers 
are  done  in  the  obvious  manner.  Without  ex¬ 
plicitly  writing  out  the  functional  form  of  the 
network  output,  let  U5  'simply  express  the 
output  as 

N(X)  =  N(X,0) 

where  0  is  a  large  dimensioned  vector  of 
the  parameters  in  the  network.  These  pa¬ 
rameters  include  all  weights  and  thresholds. 
Viewed  as  such,  an  MFC  network  imple¬ 
ments  one  function  from  a  family  of  func¬ 
tions  parameterized  by  0. 

Now  suppose  that  some  system  produces 
samples  of  input/output  data  of  the  form 

{(Xi,f(Xi)),  lisisM} 


Here  f  is  the  real-valued  response  function  of 
the  system  -  for  input  vector  X,  the  system 
output  is  f(X).  Based  on  these  observations, 
an  MFC  network  is  sought  that  approxi¬ 
mately  interpolates  the  data  and  hopefully  ex¬ 
trapolates  to  be  a  good  approximation  of  f 
over  the  whole  input  domain  of  the  system. 

Thus  we  seek  to  find  the  parameters  0  that 
minimize  some  error  criterion  where  the  error 
is  taken  to  be  the  difference  between  the  ac¬ 
tual  system  output  and  the  network  output. 

Algorithms  for  adapting  0  to  attempt  to 
minimize  this  error  criterion  are  called 
supervised  training,  learning,  etc.  algorithms. 
Viewing  the  situation  from  the  perspective  of 
nonlinear  optimization,  most  of  these  learning 
algorithms  are  gradient  descent  methods 
whereby  some  estimate  of  the  gradient  of  the 
error  function  (gradient  with  respect  to  the 

parameters  0)  is  used  to  update  and  improve 

an  estimate  for  0  [Pa87j. 
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Such  empirical  parametric  model  fitting  is  of 
course  the  essence  of  much  of  applied  statis¬ 
tics  and  approximation  theory  and  by  no 
means  a  revolutionary  idea  in  its  own  right. 

In  fact,  scientists  and  engineers  have  for 
centuries  used  parametric  models  such  as 
polynomials,  splines,  rational  functions, 
Fourier  series,  exponentials  and  so  on  to  in¬ 
terpolate  and  extrapolate  empirical  data. 

What  then  is  the  novelty  of  neural  network 
theory?  From  the  point  of  view  of  MFC  net¬ 
works,  we  believe  that  the  novelty  lies  pri¬ 
marily  in  two  quite  different  directions.  First 
of  all,  the  kinds  of  parametric  models  being 
used  in  neural  network  theory  typically  in¬ 
volve  sigmoidal  functions  quite  different  and 
primitive  in  comparison  with  traditional  alge¬ 
braic  or  transcendental  functions.  The  impli¬ 
cations  of  using  combinations  and  composi¬ 
tions  of  such  primitive  functions  fot  ap¬ 
proximation  are  not  yet  clear  although  sig¬ 
moidal  functions  that  are  normally  used  have 
certain  locality  properties  that  suggest 
robustness.  Secondly,  there  is  a 
preponderance  of  case  studies  and  examples 
illustrating  that  MFC  networks  work 
reasonably  well  across  an  array  of  seemingly 
different  applications.  This  is  not  to  say  that 
the  MFC  network  approach  is  the  best 
approach  among  many,  just  that  it  works 
quite  often.  In  this  respect,  there  is  a  certain 
similarity  between  MFC  networks  and  sim¬ 
ulated  annealing  [H?i85]  -  (hey  both  seem  to 
be  reasonably  good  at  solving  many  different 
types  of  problems  but  for  any  given  problem 
there  may  well  be  a  better  way  to  solve  that 
problem.  This  fact  alone  begs  for  a  better 
explanation. 

There  is  of  course  another  important  innova¬ 
tion  in  that  at  some  level  of  abstraction,  MFC 
networks  are  biologically  meaningful  models 
of  intelligent  behavior  and  their  study  sheds 
light  on  the  neurophysiological  foundation  of 
intelligence. 

3.  Theoretical  Capabilities 

For  a  given  choice  of  network  parameters,  an 
MFC  network  implements  a  continuous 
function.  Without  constraining  the  architec¬ 
ture  or  size  of  the  network,  what  kinds  of 
functions  can  be  arbitrarily  well  approximated 


by  the  output  of  a  neural  network?  This  of 
course  depends  heavily  on  the  class  of  net¬ 
work  architectures  being  considered  and  the 
type  of  nonlinearity  implemented  by  a  single 
node. 

In  prior  research,  we  have  definitively  an¬ 
swered  a  number  of  these  questions  in  a  rig¬ 
orous  manner.  Define  a  class  of  network  ar¬ 
chitectures  to  be  complete  if:  given  a  continu¬ 
ous  function,  f,  with  compact  support  and  an 

e  >  0,  there  is  a  network  from  that  class 
whose  output  approximates  f  uniformly  to 

within  e  over  the  support  of  f.  For  example, 
there  are  many  classical  classes  of  functions 
that  are  complete  -  polynomials,  multinomi¬ 
als,  Fourier  series  and  so  on. 

We  have  shown  that  the  following  classes  of 
networks  are  complete  in  this  sense: 

1.  networks  with  two  hidden,  internal  lay¬ 
ers  and  any  continuous  sigmoidal  nonlin¬ 
earity  [Cy88a]; 

2.  networks  with  a  single  internal  hidden 
layer  and  any  continuous  radial  basis 
type  function  (see  [Bu88,Ca87, 
Mo88,Po87]  for  discussions  of  radial 
basis  functions  -  one  can  think  of  them  as 
generalizations  of  spherically  symmetric 
Gaussian  densities  in  density  estimation 
problems)  as  a  nonlinearity  [Cy88a]; 

3.  networks  with  a  single  internal  hidden 
layer  and  any  continuous  sigmoidal  non¬ 
linearity  [Cy88b]. 

These  results  make  absolutely  no  claims 
about  the  number  of  nodes  needed  to  perform 
the  approximation  although  in  some  cases, 
gross  and  probably  unrealistic  upper  bounds 
could  be  obtained. 

Of  these  three  results,  the  last  concerning 
networks  with  only  one  internal,  hidden  layer 
is  certainly  most  surprising.  It  has  generally 
been  felt  that  such  network  could  implement 
decision  functions  for  convex  regions  and 
there  have  been  examples  of  special  noncon- 
vex  regions  being  discriminated  as  well 
(Li87,Ni65,Wi8/J  but  a  general  result  has 
been  missing.  We  believe  that  the  results  of 
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[Cy88b]  are  definitive  in  their  resolution  of 
the  issue. 

The  proofs  of  1.  and  2,  above  are  construc¬ 
tive  and  basically  reduce  to  showing  that 
networks  in  that  class  can  implement  so- 
called  approximations  to  the  identity  or 
Parzen  windows  [Pa62,Du73]  together  with 
sums  of  such  functions.  It  is  well  known 
that  convolution  with  approximations  to  the 
identity  approaches  the  identity  function 
uniformly  over  a  compact  domain.  What 
remains  is  to  show  that  the  convolution 
integrals  can  be  uniformly  approximated  by 
finite  Riemann  sums  over  the  whole  domain. 
By  contrast,  the  proof  of  3.  is 
nonconstructive,  using  the  Hahn-Banach  and 
Reisz  Representation  Theorems  to  show  that 
a  certain  linear  subspace  is  dense  in  the  space 
of  all  continuous  functions. 

In  summary,  we  feel  that  these  results  give 
rigorous  meaning  to  the  assertion  that  in 
prineipal  any  continuous  function  can  be  ap¬ 
proximated  by  any  of  the  three  classes  dis¬ 
cussed  above.  Extensions  to  discontinuous, 
integrable  functions  are  outlined  in  [Cy88b] 
as  welt. 

Given  that  the  classes  of  networks  described 
above  share  the  same  completeness  properties 
as  many  classical  classes  of  functions 
(splines,  polynomials,  Fourier  series,  expo¬ 
nential  families),  what,  if  any,  properties  of 
MFC  networks  make  them  distinct?  As  we 
have  mentioned  before,  in  cases  1.  and  2. 
above,  the  networks  are  capable  of  imple¬ 
menting  Parzen  window  type  estimators  and 
hence  there  is  a  certain  localization  property 
that  such  approximations  have.  In  a  noisy 
approximation  problem,  this  might  be  inter¬ 
pretable  in  terms  of  robustness.  Secondly, 
the  strong  biological  motivation  makes  the 
study  of  these  types  of  approximating  fami¬ 
lies  interesting  from  a  purely  intellectual  point 
of  view  -  if  indeed  nature  implements  pattern 
recognition  and  classification  this  way  using 
neurons,  then  it  is  interesting  to  understand 
how  that  is  done. 

4.  Feasibility 

The  results  summarized  in  the  previous  sub¬ 
section  indicate  that  network  architectures  and 


the  nonlinearities  that  they  implement  do  not 
constrain  the  kinds  of  problems  that  can  be 
handled  by  MFC  networks.  However,  in  any 
real  engineering  attempt  to  implement  a  net¬ 
work  solution,  constraints  must  be  imposed 
on  the  number  of  nodes  used,  the  amount  of 
data  that  can  be  observed  and  the  complexity 
of  the  algorithm  used  to  find  suitable  network 
parameters. 

There  have  been  numerous  recent  efforts 
trying  to  deal  with  such  issues  for  a  variety 
of  different  settings  [Ah88,Ba88,B187, 
Ke87,Va84].  Valiant  formalized  a  notion  of 
feasibility  with  respect  to  learning  a  boolean 
function  and  demonstrated  that  eertain  classes 
of  boolean  functions  were  feasibly  leamable 
in  that  sense  [Va84].  (It  should  be  clarified 
that  in  the  context  of  our  prior  discussion, 
learning  is  any  technique  for  selecting  model 
parameters  that  let  the  parameterized  system 
duplicate  or  approximate  the  input/output  be¬ 
havior  of  the  observed  system.)  Valiant 
introduces  a  probabilistic  setting  for  learning 
that  is  reminiseent  of  classical  hypothesis 
testing. 

Blumer  et  al.  generalized  Valiant's  ideas  to 
more  general  notions  of  learning  (for  exam¬ 
ple,  learning  rectangles  or  convex  sets)  and 
related  feasibility  in  learning  to  the  concept  of 
Vapnik-Chervonenkis  dimension  in  a  non¬ 
trivial  manner  [B187].  Vapnik-Chervonenkis 
dimension  was  an  idea  introduced  in  non- 
parametric,  distribution  free  pattern  recogni¬ 
tion  some  time  ago  [Va71]  and  its  interpreta¬ 
tion  and  utility  in  the  context  of  learning  is 
therefore  quite  natural  although  not  at  all  ob¬ 
vious.  Baum  and  Haussler  have  recently  ap¬ 
plied  those  results  to  neural  networks  with 
hard  limiting  nonlinearities  by  estimating  the 
Vapnik-Chervonenkis  dimension  of  a  simple 
class  of  neural  networks  [Ba88].  However, 
the  results  of  [Ba88]  are  disappointing  from  a 
practical  point  of  view  since  the  results  make 
statements  about  the  extent  to  which  neural 
networks  can  accurately  generalize  assuming 
that  some  fraction  of  the  empirical  data  pre¬ 
sented  to  the  network  can  be  correctly 
learned,  without  directly  addressing  the  diffi¬ 
cult  question  of  what  sets  of  data  can  be 
learned  by  such  (finite)  networks.  Recent 
work  by  Judd  and  Rivest  (Ju88,Ri88] 
demonstrates  that  this  is  indeed  a  difficult 
question  by  showing  that  the  problem  of  de- 
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termining  whether  a  given  network  architec¬ 
ture  can  exactly  implement  a  given  empirical 
data  set  is  in  general  NP-complete. 

All  of  the  research  discussed  above  deals 
with  Boolean  (0,1)  valued  systems  such  as 
Boolean  expressions  and  characteristic  func¬ 
tions  of  sets.  The  situation  with  respect  to 
real-valued  functions  and  real-valued  net¬ 
works  is  largely  uncharted  territory.  There 
has  been  some  recent  theoretical  analysis  of 
so-called  universal  Donsker  classes 
[Du84,Du87J  that  generalize  Vapnik- 
Chervonenkis  classes  in  the  context  of  dis¬ 
tribution  free  limit  theorems  but  even  then,  it 
appears  that  most  interesting  examples  are 
closely  related  to  the  idea  of  Vapnik-Chervo- 
nenkis  dimension  anyway.  There  are  some 
intriguing  relationships  between  Donsker 
classes  and  metric  entropy  [Du87]  that  might 
be  interpretable  in  terms  of  signal  bandwidth 
-  we  discuss  this  shortly.  Accordingly,  most 
of  the  work  on  real-valued  networla  has  been 
empirical  (such  as  [La87,Mo88]  and  many 
papers  in  [NN1,NN2] ). 

The  problem  of  approximating  a  real-valued 
function  by  some  parametric  combination  and 
composition  of  simple  functions  is  of  course 
the  raison  d'etre  of  classical  approximation 
theory.  The  traditional  measure  of  how  easy 
or  hard  a  continuous  function  is  to  approxi¬ 
mate  is  given  by  the  magnitude  of  the  func¬ 
tion's  derivative.  Generally  speaking,  func¬ 
tions  with  small  derivatives  are  easier  to  ap¬ 
proximate  because  they  change  at  a  slower 
rate.  However,  even  functions  with  small 
derivatives  are  very  hard  to  approximate  if  the 
dimension  of  the  underlying  space  is  moder¬ 
ately  large.  A  precise  statement  of  this  fact 
can  be  stated  as  follows: 

Suppose  that  |f(x)|  <  1, 


than  ce  points  for  some  constant  c.  Con- 

-n 

versely,  if  chosen  properly,  0(e  )  points  are 
sufficient. 

A  simple  application  of  the  mean  value  theo¬ 
rem  shows  that  sampling  f  at  that  many 
points  (properly  distributed)  is  sufficient 
while  constructing  a  simple  class  of  functions 
f  that  oscillate  unpredictably  but  within  the 
constraints  shows  that  that  sampling  is 
necessary.  (The  details  are  simple  and  the 
reader  can  easily  fill  them  in.)  For  example,  if 
we  want  to  approximate  such  a  function  so 
that  the  approximation  has  two  significant 

digits,  then  e  =  0.01  and  for  n=6  we  need 

about  10^^  samples  of  the  function.  This 
oteervation  is  completely  independent  of  the 
technique  that  we  use  for  approximating,  be  it 
polynomials,  Fourier  series  or  neural  net¬ 
works.  Moreover,  this  could  also  be  inter¬ 
preted  in  terms  of  the  classical  sampling  the¬ 
ory  of  multidimensional  signal  processing  - 
signal  bandwidth  and  the  sampling  rate  are 
closely  related  in  a  like  manner. 

This  example  illustrates  that  smoothness  of  a 
function  is  not  sufficient  for  making  the 
problem  of  approximating  the  function  feasi¬ 
ble  -  the  problem  lies  with  the  volume  of  the 
sample  space  as  a  function  of  linear  dimen¬ 
sion  which  grows  exponentially  in  the  num¬ 
ber  of  variables.  Accordingly,  multi¬ 
dimensional  approximation  theory  has 
largely  restrict^  itself  to  problems  involving 
very  small  dimensioned  coordinate  spaces. 
Similarly,  empirical  data  analysis  has  had  to 
be  restricted  to  small  dimensions.  Two  no¬ 
table  exceptions  are  the  techniques  of 
projection  pursuit  [Hu85]  and  CART 
(classification  and  regression  trees)  [Br84|. 


I  dx;  I 

for  X  S  (!„  being  the  unit  n-cube  in  R"  ). 
Then  if  we  seek  an  approximation  g(x)  so 

that  |f-g|  <  E  on  I„ ,  we  must  sample  f  at  more 


One  of  the  guiding  principles  of  both  neural 
network  theory  and  projection  pursuit  meth¬ 
ods  is  that  some  multidimensional  functions 
have  parsimonious  representations  in  terms 
of  linear  combinations  and  functions  of  a 
single  variable.  Linear  combinations  and  uni¬ 
variate  functions  are  considered  relatively 
easy  to  estimate  and  compute.  That  attitude  is 
encouraged  by  the  well-lmown  result  of 
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Kolmogorov  [Ko57,Lo76]  that  goes  as  fol¬ 
lows: 

Theorem  [Kolmogorov]  There  exist 
m(2m+l)  continuous  increasing  univariate 
functions  hp^  with  the  property  that  given  any 
continuous  function  f  on  there  is  a  contin¬ 
uous  univariate  function  g  so  that 

f(xi,  xj  = 

2m+l  m 
q=l  p=l 


This  representation  involves  summations, 
fixed  univariate  functions  and  only  one  uni¬ 
variate  function  that  is  not  predetermined, 
namely  g(x).  While  superficially  this  sounds 
encouraging,  it  packs  all  of  the  complexity  of 
the  multidimensional  function  f  into  the  uni¬ 
variate  function  g.  See  [Di84j  for  some 
discussion  of  the  properties  of  functions 
representable  in  such  terms  involving 
polynomials  only. 

We  have  tried  to  investigate  the  Kolmogorov 
function,  g,  defined  above  for  a  complex 
problem  in  spectral  estimation.  Our  numeri¬ 
cal  experiments  sought  to  get  least  squares 
estimates  of  g  with  increasing  accuracy.  The 
results  clearly  show  that  the  complexity  of  g 
is  enormous  -  it  a  highly  oscillatory  function 
that  is  poorly  approximated  by  Fourier  series 
or  other  orthogonal  basis  functions.  This 
leads  us  to  conjecture  the  existence  of  a  rela¬ 
tionship  between  the  complexity  of  a  general 
multidimensional  function,  f,  and  the  com¬ 
plexity  of  its  univariate  version,  g,  via  the 
kolmogorov  representation.  The  complexity 
of  a  function  can  be  measured  for  instance  in 
terms  of  its  bandwidth  (ie  spectrum).  We 
believe  that  there  arc  severe  limitations  on  the 
complexity  of  multidimensional  functions  that 
can  be  implemented  as  simple  combinations 
and  compositions  of  univariate  functions 
such  as  sigmoidals.  We  believe  that  some 
research  ought  to  be  devoted  to  such  ques¬ 
tions. 


At  the  same  time  as  we  outline  this  dismal 
situation,  there  have  been  a  number  of  ex¬ 
amples  where  MFC  networks  have  done  an 
admirable  job  of  modeling  and  approximating 
complex  time  series  via  nonlinear  prediction 
[La87,Mo88].  Those  examples  require  a 
closer  look  to  see  exactly  what  kind  of  mech¬ 
anism  is  used  for  generating  the  time  series. 
The  example  in  [La87]  shows  that  a  simple 
network  can  learn  and  then  replicate  quite 
well  the  behavior  of  the  quadratic  map  of  the 
unit  interval  into  itself  given  by  f(x,b)  = 
bx(l-x).  The  time  series  is  generated  by  iter¬ 
ating  f.  This  family  of  maps,  as  b  varies,  ex¬ 
hibits  period  doubling  and  chaos.  Hence,  for 
different  values  of  b,  a  plot  of  the  time  series 
can  look  impressively  complex.  However, 
the  underlying  function  itself  that  generates 
this  complex  behavior  is  by  any  measure  very 
simple  to  approximate.  It  is  a  two  dimen¬ 
sional  quadratic  function.  The  general  theory 
outlined  by  Feigenbaum  [Fe78j  shows  that 
the  behavior  exhibited  by  bx(l-x)  is  generic 
and  will  be  exhibited  by  any  function  that  is 
unimodal  with  a  quadratic  maximum.  Hence, 
any  reasonable  approximation  would  likely 
have  similar  behavior. 

The  time  series  modeled  in  [Mo88]  is  gener¬ 
ated  by  the  Mackey-Glass  equation  which  is  a 
more  complex  example  of  chaotic  behavior. 
Nonetheless,  the  model  used  four  prior  sam¬ 
ples  of  the  series  to  predict,  nonlinearly,  a 
future  sample.  The  modeling  in  [Mo88]  ba¬ 
sically  involves  estimating  a  real-valued 
function  of  four  real  variables  and,  by  our 
previous  observations,  this  comes  close  to 
what  must  be  regarded  as  a  feasible  problem 
to  solve  in  general.  To  understand  this 
particular  example  better,  we  need  to  examine 
solutions  to  the  Mackey-Glass  equation  and 
see  if  they  possess  any  special  properties,  in 
terms  of  either  predictability  or  smoothness. 

S.  Determining  Feasibility 

The  discussion  of  the  previous  paragraph 
surrounded  the  question  of  identifying  gen¬ 
eral  analytic  criteria  for  determining  the  feasi¬ 
bility  of  using  MFC  networks  to  implement 
approximate  solutions  to  problems  in 
continuous-valued  applications.  These  appli¬ 
cations  include  nonlinear  time  series  predic- 
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tion  and  the  implementation  of  difficult  to 
compute  functions. 

The  practical  problem  remains  of  deciding 
whether  a  given  empirical  data  set  is  of  the 
type  that  could  be  feasibly  implemented  by  an 
WUFC  network.  It  may  not  be  possible  to  de¬ 
termine  whether  the  underlying  application 
satisfies  the  requisite  criteria,  whatever  they 
may  be.  This  would  be  the  case  in,  for 
example,  continuous  recognition  problems 
such  as  signature  classification  from  sonar, 

IR  or  radar  imaging  data.  The  underlying  an¬ 
alytical  model  th^at  determines  the  classifica¬ 
tion  may  be  too  complex  or  difficult  to  ex¬ 
press  explicitly  to  decide  whether  the 
application  can  be  well  served  by  MFC  net¬ 
works. 

In  fact,  classification  using  many  continuous 
input  variables  is  a  general  application  area 
that  has  been  successfully  handled  by  MFC 
networks  ([Se88]  for  example)  in  some 
cases.  We  introduce  the  informal  notion  of 
granularity  as  a  parameter  of  an  approxima¬ 
tion  problem  in  the  following  way;  granular¬ 
ity  refers  to  the  number  of  distinct  function 
values  that  are  of  interest  in  an  MFC  applica¬ 
tion  -  a  finely  grained  problem  is  one  that  in¬ 
volves  many  function  values  over  a  large  part 
of  the  input  variable  space  while  a  coarse  ap¬ 
plication  is  a  problem  with  few  function  val¬ 
ues  of  interest  and  the  regions  where  the 
function  assumes  all  except  one  of  those  val¬ 
ues  are  sparsely  distributed  in  the  input 
space. 

Thus  in  a  classification  problem  involving  the 
recognition  of  say  10  signatures  from  20  real- 
valu^  signal  statistics  would  be  characterized 
as  a  coarse  problem  since  the  classification 
would  be  nontrivial  typically  in  only  10  iso¬ 
lated  regions  of  the  20  dimensioned  input 
space.  Thus,  relatively  speaking,  the  volume 
of  the  input  space  that  involves  an  interesting 
function  value  is  relatively  small  even  though 
there  are  many  real-valued  input  variables. 
Moreover,  the  precision  sought  in  such  a 
classification  problem  is  relatively  low  com¬ 
pared  with  an  application  such  as  time  series 
prediction.  In  a  qualitative  way,  let  a  coarse 
problem  be  one  that  involves  some  combina¬ 
tion  of  these  features.  What  is  an  appropriate 


quantitative  measure  of  granularity  as  dis¬ 
cussed  above? 

In  questions  such  as  this,  we  believe  that 
guidance  must  be  sought  from  very  similar 
kinds  of  problems  studied  by  statisticians  in 
the  general  methodology  of  CART 
(classification  and  regression  trees)  [Br84]. 
CAOT  is  a  statistically  based,  data  driven 
method  for  partitioning  an  empirical  data  set 
typically  using  a  succession  of  linear  dis¬ 
criminant  functions.  Loosely  speaking,  the 
hierarchy  of  linear  discriminations  determines 
a  binary  decision  tree  which  is  something 
very  similar  in  fact  to  a  multilayered  neural 
network  with  hardlimiting  nonlinearities.  The 
technique  of  projection  pursuit  [Hu85j  in¬ 
volves  computing  good  (with  respect  to  some 
criterion)  projections  of  multidimensional 
data  onto  a  vector  direction  and  performing 
general  nonlinear  regression  on  the  projected 
data.  That  basic  step  of  projection  and  re¬ 
gression  is  iterated  on  the  residual  data.  The 
resulting  functional  form  of  the  approxima¬ 
tion  resembles  the  Kolmogorov  representa¬ 
tion  very  closely  and  this  is  discussed  in 
more  detail  in  [Di84]. 

We  pose  the  following  two  questions  as  a 
challenge  for  the  statistical  audience,  given 
the  various  observations  that  we  have  made 
above. 

Can  statistical  techniques  such  as  CART 
and  projection  pursuit  be  used  as  prepro¬ 
cessing  steps  for  determining  the  feasibil¬ 
ity  of  applying  MFC  networks  to  a  specific 
empirical  data  set? 

How  does  the  performance  of  MFC  net¬ 
works  compare  with  CART  and  projec¬ 
tion  pursuit  on  sparse  continuous  classifi¬ 
cation  problems? 

In  summary,  we  believe  there  are  valuable 
contributions  to  be  made  by  using  known 
statistical  techniques  to  assess  the  feasibility 
of  using  MFC  neural  networks  in  a  variety  of 
problems. 
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Abstract 

Stochastic  models  of  some  £ispects  of  the  electrical  activity  in  the  nervous  system  at  the  cellular  and 
network  levels  are  investigated.  In  particular,  models  of  the  subthreshold  activity  of  the  somal 
transmembrane  potential  of  neurons  are  considered  along  with  methods  of  identification  of 
physiological  parameters  of  the  discussed  models.  A  simulation  study  is  conducted  to  evaluate  the 
performance  and  efficiency  of  the  estimates  of  the  parameters. 


1.  Introduction.  Studies  of  mechanisms  underlying  neural  coding  and  the  representation  of 
information  in  the  nervous  system  are  of  great  interest  to  neuroscientists  and  modelers  of  neural 
networks.  Stochastic  models  are  essential  tools  in  describing  the  behavior  of  neurons  under  conditions 
where  large  numbers  of  inputs  and  internal  events  occur  at  the  cellular  and  network  levels.  For 
instance,  there  is  an  extensive  literature  concerning  exjjerimental  and  theoretical  studies  of  neuronal 
integration  of  synaptic  inputs  as  reflected  by  the  difference  in  potential  across  the  somal  membrane  of 
nerve  cells  (see  e.g.  Johannesma,  1968;  Tuckwell,  1979;  Ricciardi  and  Sacerdote,  1979;  Baranyi  and 
Feher,  1981;  Kallianpur,  1983;  Habib,  1985;  Ferster,  1987;  Habib  and  Thavaneswaran,  1988.)  The 
stochastic  models  developed  in  some  of  these  studies  relate  the  subthreshold  behavior  of  somal 
membrane  potential  near  the  spike  generation  (or  initial)  region  to  physiologically  meaningful 
parameters.  These  include  the  effective  membrane  time  constant,  amplitudes  and  rate  ol  occurrences  of 
membrane  perturbations  due  to  the  arrival  of  excitatory  and  inhibitory  post-synaptic  potentials 
(EPSPs  and  IPSPs,  respectively),  and  measures  of  variability  of  synaptic  inputs.  Estimation  of  these 
parameters  using  experimentally  generated  intracellular  recordings  of  the  neuronal  membrane  potential 
should  shed  light  on  some  aspects  of  neuronal  integration  of  synapic  input. 

In  Section  2,  we  present  several  Ito-type  stochastic  differential  equation  models  that  describe 
the  activity  of  different  types  of  neurons  or  activity  of  certain  type  of  neurons  under  different 
experimental  conditions.  In  Section  3,  we  discuss  statistical  methods  of  parameter  estimation  such  as 
maximum  likelihood  and  the  theory  of  optimal  estimating  functions.  In  Section  4,  we  report  on  a 
simulation  study  to  evaluate  the  performance  of  the  parameter  estimators. 

^This  research  was  supported  by  research  contract  with  the  Office  of  Naval  Research,  Contract 
Number  N00014-83-K-0387. 
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2.  Stochastic  Neuronal  Models.  Assume  that  the  state  of  the  neuron  is  characterized  by  the 
difference  in  potential  £icross  its  membrane  near  a  spatially  restricted  area  of  the  soma  called  the 
trigger  zone  (or  spike  initiation  region).  The  membrane  potential  is  modeled  by  a  stochastic  process, 
V(t),  defined  on  a  probability  sp^lce  (fl,  "J,  P).  It  is  subject  to  instantaneous  changes  due  to  the 
occurrence  of  a)  EPSPs  which  are  assumed  to  occur  according  to  mutually  independent  Poisson 
processes  P(Af;  t)  with  rates  (k=l,2,. .  .,ni),  each  accompanied  by  an  instantaneous  displacement  of 
V(t)  by  a  constant  amount  >  0  (k=l,2,. . .,n,),  and  b)  IPSP  which  occur  according  to  indepeiident 
Poisson  processes  t)  with  effective  displacement  aj^  >  0  (  k  =  1,  2,, , ,,  n2).  Between  PSPs,  V(t) 

decays  exponentially  to  a  resting  potential  with  time  constant  r.  As  a  first  approximation  the  PSPs 
are  assumed  to  sum  linearly  at  the  trigger  zone,  and  when  V(t)  reaches  the  neuron’s  threshold,  an 
action  potential  takes  place.  Following  the  action  potential,  V(t)  is  reset  to  a  resting  potential.  Based 
on  this  simplified  model  neuron  and  considering  Oj  excitatory  synapses  and  n2  inhibitory  ones,  the 
membrane  potential  V(t),  is  modeled  as  a  solution  of  the  stochastic  differential  equation 

(2.1)  dV(t)  =  pV(t)dt  +  E  4  dP(A*  ;t), 

k=l  k=l 

lk> 

where  V(0)  =  Vq  and  p  =  Under  certain  conditions  the  solution  of  (2.1)  is  a  homogeneous 

Markov  process  with  discontinuous  sample  paths.  This  model  is  known  as  Stein’s  model  (Stein,  1965) 
and  is  a  special  case  of  the  well  known  Poisson  driven  Markov  process  models.  This  model  has  been 
treated  in  the  literature  by  many  authors,  among  them  Johannesma  (1968)  and  Tuckwell  (1979). 

Diffusion  models  in  which  the  discontinuities  of  V(t)  arc  smoothed  out  have  been  sought  as 
approximations  to  the  discontinuous  model  (2.1)  (see  e.g.  Ricciardi,  1982;  Kallianpur,  1983;  Lansky 
and  Lanska,  1987).  These  types  of  approximations  are  justified  on  the  grounds  that  for  many  types  of 
neurons  in  the  central  nervous  system,  synapses  are  densely  pcicked  along  the  dentritic  tree.  If  the 
jumps  of  V(t)  are  small  and  the  rates  of  occurrence  of  the  post-synaptic  potentials  are  very  large,  then 
the  approximation  of  the  Pois-son  driven  Markov  model  by  a  diffusion  model  is  appropriate  and  is 
accomplished  by  allowing  the  amplitudes  to  tend  to  zero  and  the  frequencies  A^  ,  Aj^  to 

become  large  in  a  certain  manner.  Under  some  regularity  conditions  it  was  shown  that  model  (2.1)  can 
be  approximated  by  the  diffusion  model 

(2.2)  dV(t)  =  (-pV(t)  -I-  /i)  dt  -1-  <r  dW(t),  0  <  t  <  T, 

V(0)  =  Vg,  where  W  is  the  standard  Wiener  process  (or  Brownian  motion). 

As  has  been  mentioned,  model  (2.2)  describes  the  subthreshold  activity  of  the  somal  membrane 
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potential  of  neurons  which  receive  extensive  (or  rapid)  synaptic  input  with  relatively  small  potential 
displacements.  This  model  may  be  suited  for  neurons  which  are  spontaneously  active.  However,  in 
many  situations  especially  for  stimulus  driven  neurons  this  last  assumption  on  synaptic  input  might  be 
too  stringent  because  the  nerve  cell  might  receive  a  limited  number  of  effective  synaptic  Inputs  that 
induce  relatively  large  potential  displacements,  in  addition  to  the  extensive  synaptic  diffusion  inputs 
discuss  ed  above.  For  example,  in  a  study  of  the  organization  of  inputs  from  the  lateral  geniculate 
nucleus  to  cells  in  the  striate  cortex  of  the  cat,  Tanaka  (1983)  found  that  about  10  genicular  neurons 
arc  functionally  connected  to  one  simple-cell  during  the  presentation  of  effective  stimuli.  A  large 
(converence)  number  (more  than  30)  was  obtained  from  studies  of  geniculate  projection  to  complex 
cells.  In  this  case  a  mixed  model  of  diffusion  and  p)oint  process  inputs  may  be  more  suitable  for 
describing  the  activity  of  such  cortical  neurons.  To  that  end,  assume  that  in  addition  to  the  extensive 
synaptic  input  leading  to  the  diffusion  model  (2.2),  there  are  nj  EPSPs  arriving  according  to 
independent  Poisson  processes  N(A^,  t)  with  random  intensities  A^,  and  EPSP  displacement 
amplitudes  o^,  k=l,2,. .  .,n].  In  addition,  IPSPs  are  arriving  according  to  the  indejjcndent  processes 

N(Aj^,  t),  with  the  corresp>onding  parameters  Aj^  and  oj.,  k=l,2 . n2.  This  setup  leads  to  the 

following  extended  mixed  model  to  describe  the  membrane  potential  of  a  stimulus  driven  neuron: 

n  j  n  •)  . 

(2.3)  dV(t)  =  (-pV(t)  +  +  dW(t)  -f  ofdN(A^.,  1)  -  E  o'kdN(A!  .  t). 

k=l  k=l 


Model  (2.3)  is  remarkably  similar  to  the  continuous  neuronal  model  proposed  by  Hopfield 
(IQSd).  The  problem  of  parameter  estimation  of  the  mixed  model  has  not  been  sufTiciently  addressed 
in  the  literature.  In  the  next  section  we  treat  the  problem  of  parameter  estimation  of  the  diffusion 
model  (2.2)  and  the  mixed  model  (2.3). 

3.  Parameter  Estimation  of  a  Diffusion  Neuronal  Model.  Lansky  (1983,  198‘1)  considered  the 
problem  of  parameter  estimation  for  diffusion  neuronal  models  observed  over  a  fixed  interval  [9.  I  ]  and 
discus.sed  the  asymptotic  properties  of  the  estimators  as  'l'-+3c.  (liven  n  independent  trajectories 
{V,t(t),  0  <  t  <  }  k  =  1,  2,  . . .  ,  n,  where,  r,,  are  indeixuident  random  variables  (stopping 

times)  with  P(r;t<‘^’)  =  1,  k  =  1,  2,  ...  ,  n. 

Habib  (1985)  derived  maximum  likelihood  estimators  of  the  parameters  p  aiui  /<  and 
established  their  large  sample  properties  such  a.s  strong  consistency  and  asymptotic  norntality  a.ssuming 
(T  is  known.  Now  recall  the  diffusion  neuronal  model  (2.2).  From  Sorensen  (1083).  the  log-likelihood 
function  is  given  by 
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(3.1) 


fi)  =  t  {  f  ^  (-P\it)  + ^)dV^(t) 

k  =  l  J  ^k-1 


1 

2 


(-/’V,^(t)  +  /i)2  dt}. 

1 


The  maximum  likelihood  estimators  (MLE)  Pn  ^^d  pn  P  P  respectively  are  simply  those  values 
given  by 


(3.2) 


n  ^k 

Dn(E  /  V^(t)dV;^(t)l-[  f:  /  V^(t)dt][f:  /  dV^(t)] 

A  _  k  =  H'k-l  '«  =  1‘'  ’’k-l  k  =  H  ’’k-l 

Pn  _  - 

it  f  V.(t)dtp  -  DnEn 

k  =  lJ 


(3.3) 


T^k  '  k 

Vk(t)dt][  J  /  V^(t)dV^(t)-En[i:  /  dV^(t)l 

1 _ k  =  H  ^k-1 _ k  =  H  ^k-1 

^k 

IE  /  V.(t)dtp  -  DnEn 

k  =  lJ  \-i 


where 


=  IE  ("k'^k-l^l 

k  =  l 


V^(l)  dt], 

1 


Using  the  fact  that  the  membrane  potential  V|^(l)  is  observed  continuously  over  raiidom  intervals,  the 
diffusion  coefficient  s  may  be  estimated  from  an  observed  trajectory  Vj^  (k  =  l  ,2,. .  .,n)  by  the  formula 


.nu 


(3.4)  ^-(k)  = 


(rk-rk_i) 


lim^.EjVklrk^iW  •-)-V^(r^_,+(j-l)d^2 
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This  result  may  be  proved  using  the  corresponding  result  of  Levy  for  Brownian  motion  by  transforming 
V|^  via  time  substitutions  into  Brownian  motion  (or  Wiener  process).  A  natural  estimate  of  cr^  which 
employs  all  the  observed  trajectories  is  given  by 


=  hT,  (k)- 

k=l 

The  consistency  and  asymptotic  normality  of  pp  and  pn  n-»oo)  have  been  established  in  Habib 
(1985). 


4.  Simulation  Studies.  In  this  section  we  briefly  discuss  the  results  of  a  simulation  study  to 
evaluate  the  performance  and  efficiency  of  estimates  of  the  parameters  p  and  p  of  model  (2.2).  This 
study  provides  general  guidelines  for  the  choice  of  the  number  of  observed  trajectories  and  the  length  of 
the  observation  period  of  every  trajectory. 

For  simplicity,  we  consider  the  diffusion  model  (2.2).  Assume  for  the  moment,  that  the  period 
of  observation  is  fixed,  say  [0,T].  In  this  case,  the  estimators  Pj, 'p  and  Pj^ 'p  are  defined  in  terms  of 
stochastic  and  ordinary  integrals  (c.f.  (3.2)  and  (3.3)).  But,  in  practice  one  has  to  approximate  these 
integrals  with  appropriate  finite  sums  which  depend  on  the  digitization  scheme  or  the  partition  mesh 
C  [0,T). 

In  order  to  evaluate  the  performance  of  the  estimates  Pjj  j  and  P,^  j,  we  simulated  the 
solution  of  model  (2.2)  using  the  difference  equation 

(‘l-D  '’(‘k+l)  = 

where  h  =  t/K,  t|^=  kh,  k  =  l,2,...,K.  It  is  well  known  that  the  solution  of  (2.8)  converges  to  V(t).  For 
instance,  if  we  .set  Vj^(t)  =  V(t|.)  for  tf[t|.  then 


Ef  sup  I  V(t)  -  (t)  |2)  -  0. 

X)<t<T  ' 


as  K-*oo  (see  Gihman  and  Skorokhod,  1979).  This  and  other  kinds  of  discretization,  especially  Runga- 
Kutta  schemes,  have  been  extensively  studied  (see  e.g.  Magshoodi  and  Harris,  1987). 

It  is  clear  from  Table  4.1  that  for  proces.ses  which  are  observed  over  a  period  [0,'!')  with  T=I0 
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ms,  the  estimates  of  all  parameters  except  for  <7  are  very  close  to  the  true  values  of  the  parameters  and 
they  improve  as  the  number  of  observed  trajectories,  n,  increases.  From  Table  4.2,  there  is  no 
improvement  in  the  estimators  as  the  number  of  observed  trajectories  n  increcises  (in  fact,  they 
deteriorate).  This  apparently  happens  because  for  Table  4.2  the  period  of  observation  [0,T]  was  longer, 
T  =  15  ms.  Therefore,  one  may  conclude  that  for  action  potentials  with  long  durations,  one  does  not 
gain  much  by  recording  a  large  number  of  spikes,  but  for  action  potentials  with  relatively  short 
durations,  one  can  expect  that  the  parameter  estimators  will  improve  as  the  number  of  observed  action 
potentials  increases. 

5.  Conclusions.  The  stochastic  models  considered  in  Section  2,  take  into  account  only  the 
temporal  aspects  of  synaptic  input.  It  is  well  established,  though,  that  among  the  important  factors 
influencing  synaptic  integration  are  the  geometry  of  the  dendrites  of  post-synaptic  neurons  and  the 
spatial  organization  of  synaptic  input.  Habib  and  Thavaneswaran  (1988)  proposed  a  stochcistic  partial 
differential  equation  which  is  based  on  a  cable  model  of  a  system  of  branched  dendrites  projected  onto 
a  one  dimensional  equivalent  dendrite  as  proposed  by  Rail  (1978).  The  theory  of  optimal  estimating 
functions  was  applied  in  this  case  to  obtain  estimates  of  the  model’s  parameters. 


Table  4.1  :  Parameter  estimates  using  a  simulated  diffusion  process 

observed  n-times  over  a  fixed  period  [O.T]  and  sampled  every  o  units: 
a.  T  -  10  m.s.,  S  -  0.10. 


True  Value 

Estimated 
Value 
n— 1 

Estimated 

Value 

n-10 

Estimated 

Value 

n-50 

0.33333 

0.30336 

0.33000 

0.33427 

5.00000 

4.63803 

4.84648 

4.88702 

0.31623 

0.67566 

0.67364 

0.67583 

Table  4.2  :  Parameter  estimates  using  a  simulated  diffusion  process 
observed  n-times  over  a  fixed  period  (O.T)  and  sampled  every  6  units; 

b.  T  -  20  m.s. ,  6  -  0.10. 


True  Value 

Estimated 

Value 

n-1 

Estimated 

Value 

n-10 

Estimated 

Value 

n-50 

0.33333 

0.30369 

0.32705 

0.32399 

5.00000 

4.86121 

4.77822 

4.71001 

0.31623 

0.33012 

0.51796 

0.33537 

Before  concluding,  it  should  be  noted  that  the  parameters  p  and  p  in  the  mixed  Ito-Markov 
model  (2.3)  may  be  estimated  using  the  theory  of  optimal  estimating  functions.  Indeed,  let 


(5.1) 


and 

(5.2) 


N(t)  =  E  ^  ,  t)  -  E  N  (A*  ,  t), 

k=l  k=l 


E[N(t)]  =  (E 

k  =  l 


—  E  Aj.  )  t  —  At. 
k  =  l 


Notice  that  M(t)  =  W(t)  +  N(t)  —  At  is  a  martingale  with  M(0)  =  0.  Substituting  in  (2.3),  we  obtain 
the  equivalent  model; 

(5.3)  dV(t)  =  (-pV(t)  +  p’t)  dt  +  dM(t), 

where  p'  =  p  +  A.  The  method  of  optimal  estimating  functions  can  be  used  in  this  case  and  it  can  be 
shown  that  the  optimal  estimates  of  p«and  p  are  identical  to  the  maximum  likelihood  estimates  p  and 
p  in  (3.2)  and  (3.3). 

One  may  then  estimate  the  parameters  p  and  p  of  model  (2.2)  from  data  recorded  while  the 
neuron  is  spontaneously  active.  In  the  meantime,  the  parameters  p  and  p'  of  model  (5.3)  may  be 
estimated  form  data  recorded  from  the  same  neuron  during  periods  of  stimulus-driven  activity.  In  this 
case,  it  is  possible  to  estimate  the  parameter  X  =  p'  —  p  which  reflects  the  impact  of  the  synaptic 
activity  due  to  the  presence  of  the  stimulus.  Also  a  change  in  the  value  of  the  parameter  p  may  reflect 
changes  in  the  membrane  properties  due  to  the  .stimulus. 
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Abstract 

A  varieLy  of  network  models  for  empirical  inference  have 
been  introduced  in  rudimentary  form  as  models  for  neurological 
computation.  Motivated  in  part  by  these  brain  models  and  to  a 
greater  extent  motivated  by  the  need  for  general  purpose 
capabilities  for  empirical  estimation  and  classification, 
learning  network  models  have  been  developed  and  successfully 
applied  to  complex  engineering  problems  for  at  least  25  years. 
In  the  statistics  community,  there  is  considerable  interest  in 
similar  models  for  the  inference  of  high-dimensional 
relationships.  In  these  methods,  functions  of  many  variables 
are  estimated  by  composing  functions  of  more  tractable  lower¬ 
dimensional  forms.  In  this  presentation,  we  describe  the 
commonality  as  well  as  the  diversity  of  the  network  models 
introduced  in  these  different  settings  and  point  toward  some 
new  developments. 

1.  Introduction 

In  the  context  of  empirical  inference  of  functions  of  many 
variables,  a  network  is  a  function  represented  by  the 
composition  of  many  basic  functions.  The  basic  functions 
(which  are  also  called  elements,  units,  building  blocks, 
network  nodes,  or  sometimes  artificial  neurons)  are  constrained 
in  form:  typically  nonlinear  functions  of  a  few  variables  or 
linear  functions  of  many  variables.  By  definition,  a  learning 
network  estimates  its  function  from  representative 
observations  of  the  relevant  variables. 

Several  composition  schemes  for  network  functions  and 
corresponding  estimation  algorithms  are  reviewed  in  this 
paper.  Consideration  is  given  to  certain  networks  popular  in 
the  neurocomputing  field  such  as  perceptrons,  madelines,  and 
backpropagation  networks.  (For  a  collection  of  some  of  the  key 
papters  in  this  field  see  the  volume  edited  by  Anderson  and 
Rosenfeld  1988.)  Unfortunately  many  learning  networks  are 
inflexible  in  the  form  of  the  basic  functions,  inflexible  in  the 
connectivity  of  the  network,  and  lack  global  optimization  of 
the  network  function.  More  consideration  is  given  here  to 
globally  optimized  networks,  networks  with  adaptively 
synthesized  structure,  and  networks  with  nonparametrically 
estimated  units.  Particular  attention  is  given  to  polynomial 
networks  (R.L.  Barron  et  al.  1964,  1975,  1984,  Ivakhnenko 
1971),  projection  pursuit  (Friedman  et  al.  1974,  1981,  Huber 
1985)  and  transformations  of  additive  models  (Stone  1985, 
Tibshirani  1988).  New  composition  schemes  are  suggested 
which  combine  the  positive  benefits  of  the  above  methods. 

Although  there  are  interesting  analogies  of  statistically 
estimated  network  functions  with  the  activity  of  networks  of 
living  neurons,  we  shall  not  constrain  our  network  functions  to 
be  biologically  viable  models.  Instead  the  focus  is  on  the 
development  of  empirical  modeling  capabilities  for  network 
function  so  as  to  represent  the  input/output  behavior  of  a  wide 
range  of  complex  systems  for  scientific  and  engineering 
applications. 

Mathematical  limitations  of  high-dimensional 
estimation  are  discussed.  Bounds  from  nonparametric 
statistical  theory  show  that  reasonably  accurate  estimation 
uniform  for  all  smooth  functions  (e.g.  functions  with  bounded 
first  partial  derivatives)  is  not  possible  in  high  dimensions 
with  practical  sample  sizes.  Network  strategies  avoid  some 
of  the  pitfalls  of  high-dimensionality  by  searching  for 
structures  parameterized  by  lower  dimensional  forms.  The 
advantage  is  that  for  high-dimensional  problems  the 
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variance  (estimation  error)  associated  with  such  networks  can 
be  much  smaller  than  associated  with  more  traditional 
approaches.  As  for  the  bias  (approximation  error),  the 
evidence  is  that  for  many  practically  occurring  functions 
accurate  network  approximations  exist,  in  spite  of  the 
theoretical  fact  that  high-dimensional  functions  can  possess 
sufficiently  irregular  structure  so  as  to  preclude  accurate 
estimation. 

Some  dynamic  network  models  (such  as  the  Hopfield 
network  1981)  are  differential  equations  (or  difference 
equations)  resulting  from  cycles  present  in  the  interconnected 
network.  In  this  paper  we  restrict  attention  to  static  network 
models  which  have  no  loops  in  the  network.  Thus  the  network 
is  a  tree  of  interconnected  functions  which  implements  a  single 
input/ output  function,  which  may  be  adjusted  by  the  empirical 
estimation  process,  but  otherwise  is  static. 

2.  Block  Diagrams 

We  present  a  hypothetical  network  to  get  oriented  to  some 
terminology  and  notation.  A  function  which  is  defined  as  a 
composition,  such  as 

f(Xj,  Xj,  X3,  X^)  =  go(gl(g3(^j.  X^).  g/Xj,  X3,  x^)), 

g2<g4<x,-  Xy  x^),  g/x^))). 


may  also  be  written  in  terms  of  intermediate  variables 


f  =  io<h'h> 


r,=gj(z3,z^),  Zj  =  g2(z^,Zs> 
Zj^gjiXyX^K  Z4=g4(XyXyX^).  z^  =  g^(x^). 

or  it  may  be  drawn  as  a  network  diagram  (Fig.l); 


Fig.l.  Example  Netwuik 

The  layers  of  a  network  are  the  sets  of  functions  which  occupy 
the  same  depth  in  the  tree. 

For  a  general  notation  for  network  functions,  in  which  the 
indices  on  a  basic  function  specify  the  position  of  the  function 
in  the  tree  relative  to  the  root  node,  see  Lorentz  (1966).  He 
called  network  functions  superposition  schemes.  Lorentz  made 
fundamental  contributions  to  the  theory  of  representing 
functions  by  compositions  which  are  discussed  later  in  this 
paper. 

Representations  for  network  functions  arc  not  unique.  For 
instance,  if  some  of  the  basic  functions  arc  absorbed  into  the 
functions  to  which  they  are  input,  then  fewer  elements  are 
obtained,  but  the  new  elements  have  possibly  greater  input 
dimension. 

Motivated  by  the  application  to  modeling  human  vision, 
Rosenblatt  (1962,  ch.  4)  called  networks  with  arbitrary 
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elemental  functions  perceptroits  (although  subsequently  the 
term  has  been  used  to  refer  to  just  one  type  of  network  with 
thresholded  linear  elements  that  Rosenblatt  extensively 
studied).  Our  definition  differs  slightly  from  Rosenblatt's  in 
that  he  allowed  transformations  to  occur  on  the  branches 
(interconnections)  of  the  network.  Such  networks  are 
represented  in  our  form  either  by  defining  additional  single 
input  nodes  or  by  absorbing  each  such  transformation  into  the 
node  to  which  the  branch  is  directed. 

3.  The  Building  Blocks 

For  learning  networks  it  is  important  to  choose  elements 
of  the  network  with  sufficiently  general  form  that  the 
resulting  networks  can  approximate  nearly  any  function  of 
interest.  It  is  also  important  to  choose  these  elements  with 
sufficiently  small  dimension  or  complexity  that  they  can  be 
accurately  estimated.  Different  approaches  to  resolving  the 
tension  tetween  these  two  seemingly  conflicting  objectives 
result  in  a  variety  of  different  learning  network  schemes. 

Let  the  function  g(z)  denote  an  element  of  the  network, 
where  z  is  the  vector  of  intermediate  variables  (outputs  from 
preceding  elements  or  sometimes  original  input  variables) 
which  are  input  to  the  given  node.  The  most  common  forms  of 
elements  roughly  can  be  categorized  as  parametric  or 
nonparametric. 

Parametric  elements:  These  are  basic  functions  g(z,  6)  which 
depend  on  a  vector  of  unknown  parameters.  The  parametric 
elements  which  have  been  proposed  for  learning  networks 
usually  take  one  of  the  following  forms: 


gfz,  0)  =  h(E0icZk  +  0^ 

(1) 

g(z,  0)  =  £  0y<Pk  (z) 

(2) 

more  generally. 

g{z,  0)  =  ME  0k<Pk<x)) 

(3) 

where  (pk,l:=  I,...,  m,  and  h  are  fixed  functions.  The  two  most 
common  choices  for  the  p*  arc  linear  terms  (coordinate 
functions),  so  that  the  sum  simply  implements  a  linear 
combination  of  the  inputs  as  in  (1),  or  polynomial  terms  of 
moderate  degree.  The  nonlinear  function  h  is  typically  chosen 
to  be  a  nondecrcasing  function  bounded  by  one  (such  as  a  unit 
step  function)  -  this  is  frequently  incorporated  in  networks 
intended  for  binary  classification.  The  parameters  of  each 
element  are  estimated  from  observed  data,  typically  by  a 
least  squares  or  likelihood  based  criterion.  The  specific 
method  used  to  estimate  the  parameters  depends  on  the 
probabilistic  structure  of  the  data,  the  network  synthesis 
strategy,  and  the  intended  use  of  the  network  (see  section  4 
below). 

Nonparametric  elements:  Some  of  the  element  functions  g(z) 
may  be  regarded  as  unknown  and  constrained  only  in  terms  of 
basic  smoothness  properties  (e.g.  bounded  derivative),  or  in 
some  cases  g  is  modeled  as  a  stochastic  process  indexed  by  z  (a 
Bayes  formulation).  Such  functions  are  estimated  by  a 
smoothing  technique  such  as  local  linear  fits,  smoothing 
splines,  variable  kernel  estimation,  truncated  trigonometric 
scries,  variable  degree  polynomials,  or  stochastic  process 
estimation.  Typically  parameters  of  the  smoothing  technique 
are  selected  by  a  criterion  such  as  cross-validation,  predicted 
squared  error,  or  penalized  likelihood.  With  nonparametric 
elements  it  is  important  that  the  dimension  of  the  z  variables 
be  kept  to  a  minimum.  (Otherwise  the  statistical  theory 
indicates  that  it  would  be  difficult  to  estimate  these  element 
functions.) 


Mixed  parametric/nonparametric:  In  this  case  both  types  of 
elements  appear  in  the  network.  A  particularly  interesting 
approach  is  to  combine  nonparametric  elements,  each  of  which 
depends  only  on  one  variable,  with  elements  which  implement 
linear  combinations  of  many  variables.  It  will  be  seen  that 
networks  of  this  mixed  structure  have  the  potential  to 
approximate  any  function. 

We  use  the  nutation  /(i,  6)  to  reter  to  the  complete 
network  function  where  i  is  the  vector  of  all  original  input 
variables  and  0  is  the  vector  of  all  parameters  which  appear 
in  the  network. 

4.  The  Structure  of  the  Data  and  Objective  of  Network 
Estimation 

In  practice,  networks  are  estimated  from  a  training 
sample  of  observations  of  relevant  variables.  The  sample  is 
tpically  a  sequence  of  input/output  pairs  ( Jfj,  Vi ),  •  .,  ( 
where  each  X  is  a  d-dimensional  vector.  We  focus  on  the  case 
in  which  the  observations  are  independent,  each  with  the 
same  probability  distribution  (Certain  problems 

involving  data  with  stahonary  serial  dependencies  can  also 
be  treated,  in  which  case  the  relevant  distribution  is  the 
conditional  distribution  given  the  past.)  This  probability 
distribution  is  assumed  to  depend  on  an  unknown  function /(j); 
if  is  this  function  which  neural  networks  seek  to  approximate. 
The  assumed  nature  of  this  function  depends  on  the  objective  of 
the  problem  (e.g.  regression,  prediction,  classification,  density 
estimation)  and  the  criterion  by  which  performance  is 
measured. 

Perhaps  the  most  common  use  of  learning  networks  is  to 
seek  a  function  f(i)  to  minimize  the  mean  squared  error 
E(Y-  that  is,  the  function  we  wish  to  estimate  is  the 

conditional  mean  /(x)  =  E  IY\X  -sJ.  For  problems  of  curve 
fitting,  regression,  or  prediction  this  conditional  mean  function 
has  traditionally  been  the  principle  object  of  interest  for 
learning  networks.  (For  certain  time-series  prediction 
problems  the  desired  function  takes  on  the  spc..iiic  form 
f<iJ  =  ElY,i  Y t-i  =  xj  Y,.a  =  Xil).  In  particular,  this 
framework  (associated  with  a  squared  error  measure  of  loss)  is 
appropriate  when  a  function  f(x)  is  measured  subject  to  (mean 
zero)  Gaussian  error  at  randomly  distributed  design  points. 

For  classification  problems,  an  optimal  discriminant 
function  is  one  for  which  the  overall  probability  of  error  is 
minimized.  Most  often,  learning  networks  have  been  utilized 
to  seek  an  indirect  solution  to  the  classification  problem  by 
using  the  mean  squared  error  as  the  criterion.  For  two-class 
classification  with  Y  e  10,U  the  conditional  mean  function 
reduces  to  the  optimal  discriminant  fls)  =  P  I  Y  =  ll^  =  gj. 
Nevertheless,  it  may  be  more  appropriate  to  seek  to  estimate 
the  logistic  regression  function  /(£)  =  loglP  lY  -  1  Ijj/d  - 
P  lY  -  I  IgJ))  using  likcIihood-bascd  criteria.  In  principle, 
probability  density  estimation  can  also  be  handled  using 
learning  networks  and  a  likelihood  criterion,  in  which  case  f 
is  taken  to  be  the  logarithm  of  the  joint  density  function  of  the 
random  r  ector. 

A 

The  intended  use  of  estimated  network  functions  f  may 
dictate  probability  models  and  performance  objectives  other 
than  those  indicated  above.  For  instance  the  object  may  be  to 
search  for  the  extreme  points  of  a  function  f  by  using  the 
extreme  points  of  f  .  For  problems  in  vehicle  guidance,  the 
function  /  might  estimate  parameters  of  an  optimum  (two- 
point  boundary-value)  guidance  law  as  a  function  of  current 
and  desired  final  vehicle  states  (in  situations  where  the 
optimum  f  can  only  be  obtained  by  extensive  off-line 
iteration),  in  which  case  the  ultimate  performance  objective  is 
to  minimize  the  final  miss  distance,  rather  than  to  minimize 
the  mean  squared  error  of  the  parameter  estimates. 
Nevertheless,  learning  network  methodologies  have  proven 
successful  in  some  of  lhc.se  contexts  (sec  R.  L.  Barron  and  Abbott 
1988). 
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Most  network  algorithms  have  been  designed  for 
regression  or  classification  with  minimum  mean  squared  error 
as  the  performance  objective,  and  our  attention  will  be  focused 
primarily  on  this  case. 

5,  Criteria  for  Network  Estimation  and  Selection 

Here  we  discuss  model  selection  criteria  needed  for  the 
estimation  of  network  functions.  Without  the  use  of  an 
appropriately  penalized  performance  criterion,  an  overly 
complex  network  may  be  estimated  which  accurately  fits  the 
training  data  but  will  not  prove  to  be  accurate  on  new  data. 

Predicted  squared  error:  If  a  network  structure  fix,  8)  is  fixed 
and  if  the  total  number  of  parameters  k  is  small  compared  to 
the  sample  size  n,  then  the  minimum  mean  squared  error 
mirig  E  (Y  -  f(X,  9))^  is  approximately  achieved  by  seeking 

A 

parameter  estimates  B  that  produce  the  minimum  average 
squared  error  on  the  training  set,  TSE  =  jiT.i  f  ■ 

However,  if  Ic  is  large  compared  to  n,  then  the  model  may 
have  small  error  on  the  given  data,  but  it  is  likely  to  have 
large  error  on  future  data  from  the  same  distribution.  This 
phenomenon  is  partly  explained  by  noting  that,  under  certain 
conditions  (namely  that  the  network  depends  linearly  on  the 
parameters  and  the  true  function  fix)  happens  to  be  a  member 
of  the  given  Ic -dimensional  family  with  error  variance 
cr^  =  E  iY -  fiX))^),  the  mean  squared  error  of  an  estimated 
network  of  fixed  dimension  k  is  not  e^ual  to  the  error  variance 
but  rather  is  equal  to  E  (Y  -/(X,  6))^  =  o  ^  +  ik/n)<T^:  see 
Mallows  (1973),  A.R.  Barron  (1984).  This  leads,  in  view  of  the 
fact  that  under  the  same  conditions  E  iTSE)  =  ikln)cr^,  to 
the  predicted  squared  error  PSE  criterion  as  an  unbiased 
estimator  of  the  future  performance: 

PSE  =  rSE  +  — <t2  (4) 

n 

This  criterion  is  very  similar  to  (and  in  some  cases  equivalent 
to)  the  Cp  statistic  proposed  by  Mallows  (1973),  the 
generalized  cross-validation  criterion  of  Craven  and  Wahba 
(1979),  the  final  prediction  error  of  Akaike  (1970),  and  a 
specialization  of  the  AlC  proposed  by  Akaike  (1973).  For  a 
recent  treatment  of  these  various  criteria  with  emphasis  on 
generalized  cross-validation  see  Eubanks  (1988,  ch.  2). 
Calculations  similar  to  those  in  Akaike  (1973)  show  that  PSE 
continues  to  be  an  asymptotically  unbiased  estimator  of  the 

mean  squared  error  E  iY  -fiX,  6))^  even  if  fix,  6)  is  not  a 
linear  function  of  0,  provided  this  function  is  sufficiently 
smooth. 

Unfortunately,  if  the  network  function  is  selected  so  as  to 
minimize  PSE  among  a  collection  of  functions  of  various 
parameter  dimensions,  then  there  is  no  general  guarantee  that 
the  resulting  minimum  PSE  will  bo  an  accurate  estimate  of  the 
mean  squared  error  of  the  estimated  function.  Indeed,  if  the 
true  function  /  is  a  member  of  one  of  the  finite-dimensional 
network  families,  then  the  PSE  criterion  has  a  tendency  to 
overestimate  the  dimension  (see  Atkinson  1980,  1981).  On  the 
other  hand,  the  work  by  Shibata  (1984,  1986)  shows  in  related 
contexts  that  if  the  true  function  fix)  is  not  exactly 
representable  by  any  of  the  finite  dimensional  models  in  a 
sequence /jfx,  9^)  for  k='I,2 .  (but  can  nevertheless  be 

approximated  by  such  models),  then  selection  of  k  by  a 
criterion  of  the  form  given  above  is  optimal  in  the  sense  that 
the  resulting  expected  squared  error  E  ifiX)  -  fiX))^  is  asymp¬ 
totically  equivalent  to  min^E  ifiX)  -  f^iX,  6^))^  as  It  is 

not  known  if  the  results  of  Shibata  carry  over  to  the 
estimation  of  network  functions.  Nevertheless,  in  our 
experience  with  numerous  practical  cases  (see  Barron  et  al. 
1984),  networks  selected  by  minimizing  PSE  have 
approximately  minimal  average  squared  error  on  independent 


sets  of  test  data  (in  the  sense  that  if  the  growth  of  adaptively 
synthesized  networks  is  halted  on  an  earlier  layer  or  allowed 
to  extend  to  a  larger  number  of  layers,  then  a  significant 
increase  in  the  average  squared  error  on  the  test  set  does  not 
usually  occur). 

If  the  error  variance  ff  *  is  not  known,  an  estimate 
can  be  used  in  its  place  in  the  PSE  criterion;  however,  to  avoid 
overfil  care  musi  be  taken  to  avoid  having  much  less  than 
a^;  in  particular,  should  not  be  varied  during  the  process  of 
selecting  k  (A.  R.  Barron  1984).  We  suggest  that  nearest 
neighbor  regression  be  used  prior  to  network  synthesis  to 
determine  a  rough  estimate  of  the  error  variance  with  the 
desired  properties.  To  permit  consistent  estimation  of  f  in  the 
case  that  it  can  be  exactly  represented  by  a  finite  dimensional 
network  (as  well  as  in  the  case  that  it  can  be  arbitrarily  well 
approximated  by  networks  of  sufficient  dimensionality)  other 
criteria  should  be  used  which  place  a  greater  penalty  on  the 
dimensionality  of  the  model  (e.g.  7  fog  n  instead  of  ^  ). 
Criteria  significantly  different  from  PSE  will  not  possess  the 
optimum  rate  property  of  Shibata  in  the  context  that  he 
considers;  however,  it  is  not  known  to  what  extent  the 
convergence  rate  is  slowed. 

Likelihood  based  criteria:  Suppose  the  random  vectors  iKpYf) 
have  a  conditional  probability  density  function  piy  I  x,  /) 
which  depends  in  a  known  way  on  the  value  of  /  (whereas  the 
true  function  fis)  may  be  unknown).  Let  fii,B)  be  a  given 
network  structure  with  a  k-dimensional  parameter  9.  Assume 
that  9  is  estimated  so  as  to  maximize  the  likelikhood 
pi'T  I  fS'.fi;9))  =  Yll,piYi  I  X,. ,  fiX.,9)).  Define  the  Akaike 
information  criterion  (Akaike  1973)  by 

AIC  =  -  log  piY”  I  X",  fi-,h)  +  k  (5) 

and  define  the  minimum  description  length  criterion  (Rissanen 
1978, 1983)  by 

MDL  =  -Iogp{)r'lK'’,ft;h)  +  jlogn.  (6) 

These  criteria  are  used  to  choose  between  models  of  various 
dimensions.  Akaike  derived  the  AIC  as  an  asymptotic  bias 
correction  for  the  estimation  of  expected  entropy  loss,  in  much 
the  same  manner  that  PSE  is  an  asymptotic  bias  correction  for 
the  estimation  of  expected  squared  error.  Rissanen  derived  the 
MDL  criterion  as  the  length  of  a  uniquely  decodable  code  for 
quantizatiorrs  of  the  data  V"  given  the  data  X"  (ignoring  terms 
which  are  asymptotically  constant  for  k  bounded).  Uidike  the 
optional  Shannon  code,  Rissanen's  code  does  not  require 
knowledge  of  the  function  f.  Instead,  the  MDL  code  uses 
quanbzed  maximum  likelihood  estimates  of  the  parameters  of 
the  function  as  a  preamble  of  the  code  (using  ylog  n  bits  per 
parameter).  The  criterion  can  also  be  derived  as  an 
asymptotic  approximation  for  the  Bayesian  test  statistics 
which  minimize  average  probability  of  error  in  the  selection 
of  the  model  (see  Schwarz,  1978,  Clarke  and  A.R.  Barron, 
1988). 

The  validity  of  the  derivations  of  AIC,  MDL,  and  Bayes 
criteria  require  smoothness  conditions.  In  particular  the 

A 

sample  Fisher  information  matrix  /  of  second  partial 
derivatives  with  respect  to  9  of -'-log  piY”  \  fiP,  fi,9)) 

A 

(evaluated  at  9=9)  should  be  positive  definite.  A  more 

A 

precise  form  of  the  MDL  or  Bayes  criterion  uses  ylog  detil) 
instead  of  j-log  n. 

For  regression  with  a  Gaussian  error  distribution  and 
known  error  variance,  the  AIC  reduces  to  the  PSE  criterion  and 
MDL  reduces  to  a  criterion  equivalent  to 
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TSE*^-logn)a^.  (7) 

For  classification  problems  with  Y  e  (0,1),  likelihood  based 
criteria  are  defined  by  using  the  Bernoulli  model 
p(y\  i,  f)  =  (f(s))^(l  -  (in  which  case  care  must  be 

taken  to  use  networks  with  0  <  fig)  <1).  The  equally  general 
logistic  model  pfy  1 may  be  preferred  for 
classification  problems,  since  it  forces  satisfaction  of  the 
probability  constraints  0  <p<l  without  constraining  the 
function  /.  For  logistic  regression  the  minus  log-likelihood 
takes  the  form  Xiogfl  +  -  l.YJ(^,  6),  which  is 

minimized  (e.g.  by  Newton's  method  in  the  context  of  various 
synthesis  strategies)  and  then  penalized  by  k  or  jlog  n  as 
appropriate  for  the  desired  criterion. 

Complexity  regularization:  In  A.R.  Barron  (1985)  the 
minimum  description  length  criterion  is  extended  to 
nonparametric  contexts  in  which  the  description  length  need 
not  reduce  to  the  form  of  (6).  Consistency  results  are  obtained 
in  A.R.  Barron  (1985, 1987)  which  show  convergence  (as  n  ^ »«) 
of  distributions  estimated  by  the  complexity  regularization. 
The  specialization  of  the  convergence  results  to  the  case  of 
estimation  of  network  functions  is  given  in  the  Appendix. 

6.  Main  Strategies  for  Network  Synthesis 

There  are  two  main  strategies  for  the  synthesis  of 
networks  depending  on  whether  the  structure  of  the  rretwork  is 
fixed  or  allowed  to  evolve  during  the  synthesis  prrrcess. 

Fixed  networks:  In  this  approach  a  fixed  composition  structure 
(often  relatively  large)  is  preselected  with  the  hope  that  the 
desired  function  can  be  accurately  approximated  by  networks 
of  the  selected  form.  The  problem  of  choosing  pararrreters  of 
the  network  so  as  to  optimize  a  performance  criterion  irxay  be 
regarded  as  a  global  search  of  a  highly  multimodal  surface.  In 
general,  global  convergence  is  difficult  to  guarantee; 
nevertheless,  by  choosing  a  network  function  which  depends 
smoothly  on  the  parameters  it  is  often  feasible  to  estimate 
sufficiently  accurate  network  functions  by  certain  global 
search  techniques  (e.g.  techniques  which  alternate  global 
random  and  local  gradient  search).  Other  methods  for 
estimating  network  functions  attempt  to  localize  the  search 
within  each  unit  of  the  network  by  defining  target  values  for 
each  elemental  function.  More  specifics  are  given  in  section  7 
below. 

The  advantage  of  the  fixed  network  approach  is  that 
certain  structures  are  known  to  have  the  ability  to 
approximate  any  continuous  function  (see  section  13). 
However,  for  moderate  sample  sizes,  these  fixed  structures 
may  have  too  large  a  parameter  dimension  for  the  least 
squares  or  maximum  likelihood  estimators  to  be  accurate.  In 
this  case,  to  prevent  irregularity  of  the  estimated  function,  it 
is  useful  to  constrain  the  parameters  so  that  the  resulting 
network  function  is  smooth  or  to  penalize  the  performance 
criterion  by  incorpoialing  a  term  for  the  lack  of  smoothness 
(e.g.  the  sums  of  squares  of  first  partial  derivatives  of  the 
network  functions  at  the  observations).  Of  course  the  criteria 
mentioned  in  section  5  above  are  not  adequate  when  the 
dimension  of  the  network  is  fixed  in  advance. 

Adaptive  networks:  In  this  approach,  the  attempt  is  to 
estimate  networks  of  the  right  size  with  a  structure  evolved 
during  the  estimation  process  to  provide  a  parsimonious  model 
for  the  particular  desired  function.  Typically,  the  network  is 
estimate  one  layer  at  a  time,  with  the  elements  on  each 
given  layer  select^  to  minimize  the  predicted  squared  error  or 
complexity  regularization  criterion.  The  basic  idea  is  that 
once  the  elements  on  a  lower  level  are  estimated,  and  the 
corresponding  intermediate  outputs  z  arc  computed,  then  the 


parameters  in  a  given  element  g(z,  6)  may  be  estimated  by 
usual  least  squares  or  likelihood  naaximiz;ition  techniques.  It 
is  most  conunon  for  the  elements  on  each  layer  to  be  greedily 
trained  to  attempt  to  best  estimate  the  desired  final  output, 
even  though  the  outputs  of  these  elemenis  are  combined  on 
succeeding  layers.  On  the  other  hant',  some  methods 
developed  in  statistics  select  the  element  functions  so  as  to 
work  best  in  linear  combination  with  the  previously  selected 
elements  on  a  given  layer. 

Practical  experience  shows  clear  advantages  of  the 
adaptively  synthesized  networks  over  some  of  the  globally 
optimized  fixed  network  structures.  (However,  certain 
theoretically  appropriate  fixed  structures  have  yet  to  be  tried 
in  practice;  also,  the  smoothness  penalty  criteria  have  yet  to 
be  utilized  with  the  larger  fixed  networks.)  In  most  instances 
the  adaptively  synthesized  networks  are  more  parsimonious. 
Parts  of  the  network  which  are  inappropriate  or  extraneous 
for  statistically  modeling  the  given  data  are  automatically 
not  included  in  the  final  network.  The  drawback  of  the 
adaptive  strategies  is  that  they  cannot  be  guaranteed  to  work. 
It  is  possible  to  find  counterexamples  of  data  corresponding  to 
functions  which  are  exactly  modeled  by  a  two-layer  network, 
but  no  non-trivial  first  layer  elements  are  selected  by  a  given 
adaptive  synthesis  strategy. 

Mixed  adaptive/global  strategies:  After  the  best  elements  on 
each  layer  arc  computed,  a  numeric  search  can  be  used  to 
update  the  estimates  of  parameters  for  ancestral  nodes  on 
earlier  layers.  An  iterative  scheme  that  alternates  between 
estimation  of  the  parameters  of  the  given  clement  and  the 
estimation  of  the  parameters  of  the  ancestral  nodes  is 
suggested  by  the  projection  pursuit  algorithm  and  its 
generalizations  (sec  sections  10  and  12). 

7.  Some  Early  Network  Developments 

While  linear  models  for  regression  and  thresholded 
linear  models  for  classification  (e.g,  of  the  form  (1),  (2),  or  (3)) 
have  been  long  used  in  statistical  practice  (with  the 
beginnings  of  the  modem  understanding  due  in  large  part  to 
R.A.  Fisher  (1922,  1934,  1936)  who  introduced  measures  of 
statistical  efficiency,  explained  the  efficiency  of  maximum 
likelihood  estimation,  and  derived  the  linear  discriminant 
function  for  multivariate  Gaussian  classification),  these  same 
linear  models  were  reintroduced  (unfortunately  with 
comparatively  inefficient  estimators)  in  the  1950's  and  1960’s 
as  a  basic  ingredient  in  learning  network  models.  The  new  and 
interesting  twist  was  that  more  general  classes  of  functions 
were  modeled  by  combining  these  simpler  models  into  a 
network.  Here  we  mention  some  of  the  development  which 
occurred  in  this  pteriod. 

The  forerunners  in  the  network  modeling  field  were 
McCulloch  and  Pitts  (1943),  who  introduced  the  thresholded 
linear  function  as  a  model  for  the  behavior  of  a  neuron  and,  in 
that  paper,  analyzed  the  model  not  so  much  for  its  biological 
viability,  which  was  discussed  only  briefly,  but  rather  (in  the 
language  of  theoretical  computer  science)  as  a  basic 
computational  unit  with  the  property  that  any  predicate 
with  finite  domain  could  be  implemented  by  a  network  of  such 
units. 

There  was  a  surge  of  interest  in  methods  for  the  inference 
of  networks  (Hebb  1949,  Ashby  1952,  Farley  and  Clark  1954, 
Minsky  1954,  von  Neumann  1956,  Rosenblatt  1957,  Lee  and 
Cilstrap  1960)  culminating  in  some  interesting  and  successful 
multiple  layer  estimation  methods  in  the  early  1960's  due  to 
Rosenblatt  (1962),  Widrow  et  al.  (1960,  1962,  see  also  1987), 
and  R.L.  Barron  et  al.  (1964,  see  Moddes  cl  al.  1965,  Gilstrap 
1971,  Barron  et  al.  1984).  Although  some  of  the  networks  due 
to  Rosenblatt  and  Barron  et  al.  used  more  general  elemental 
functions  than  the  original  thresholded  linear  function,  they 
did  share  the  form  (3)  (transformed  variables  were  combined 
linearly  using  free  parameters).  These  heuristic  multi-layer 
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methods  were  not  well  understood  theoretically  and  (with  the 
exception  of  Rosenblatt's  book)  they  were  not  widely 
disseminated  at  that  time.  We  emphasize  that  contrary  to 
the  popularly  held  current  belief  (initiated  in  the  book  by 
Min^y  and  Papert  1969  and  perpetuated  by  statements  as  in 
Rumelhart  et  al.  1986,  p.321),  powerful  rules  were  found  for 
the  estimation  of  multiple  layer  networks. 

The  methods  of  Widrow  et  al.  and  Rosenblatt  for  binary 
classification  possessed  many  similarities.  In  particular,  both 
authors  exclusively  utilized  recursive  estimation  strategies  in 
which  the  parameter  estimates  are  updated  with  each  new 
observation  by  an  error  correction  procedure  analogous  to  the 
Robbins-Monroe  (1951)  stochastic  approximation  (but  without 
the  full  statistical  efficiency  known  to  hold  for  recursive  least 
squares  or  recursive  implementations  of  maximum  likelihood). 
Moreover,  both  approaches  were  amenable  to  clear 
theoretical  proofs  of  convergence  properties  in  the  case  of 
single  element  networks  (these  results  are  well-explained  in 
Nilsson  (1965)  and  Duda  and  Hart  (1973)).  Widrow  used  a 
stochastic  gradient  method  which  he  called  the  least  mean 
squares  (LMS)  algorithm.  Rosenblatt  used  a  method  (related 
to  relaxation  procedures  for  solving  linear  inequalities,  Agmon 
1954),  which  he  called  the  perceptron  algorithm:  it  finds  a 
hyperplane  which  perfectly  separates  the  two  classes 
whenever  the  classes  are  linearly  separable.  The  non- 
conveigent  behavior  in  the  non-separable  case  was  analyzed 
by  Efron  (1964). 

For  multiple  layer  networks  the  method  of  Widrow  et  al. 
(1960,  1962)  was  only  explained  in  the  case  that  first  layer 
elements  are  adjustable  and  the  succeeding  layers  are 
preselected.  Widrow  used  iterations  of  his  strategy  to  handle 
also  the  more  general  estintation  problem,  but  this  approach 
was  not  published  until  Widrow  1987,  to  which  we  refer  the 
reader  for  a  description. 

For  two  and  three  layer  networks  of  thresholded  linear 
elements,  Rosenblatt  (1962,  ch.  13)  developed  an  algorithm 
which  he  called  back-propagating  error  correction 
(unfortunately,  this  name  recently  has  been  reused  for  another 
algorithm  for  network  estimation,  as  mentioned  below).  The 
objective  of  his  method  is  recursively  to  estimate  desired 
outputs  for  every  element  as  well  as  to  estimate  the 
parameters.  Naturally,  given  a  desired  output  of  an  element 
Rosenblatt  updates  the  parameter  estimates  in  the  element  by 
his  perceptron  algorithm  (here  a  parameter  update  occurs  only 
if  the  actual  output  differs  from  the  desired  output).  On  the 
other  hand,  if  the  output  of  an  element  does  match  the  desired 
value,  then  depending  on  whether  the  resulting  final  output  of 
the  network  is  in  error,  the  desired  intermediate  variable  is 
adjusted  to  reduce  this  error  (again  as  in  the  perceptron 
algorithm  but  with  the  role  of  parameters  and  variables 
reversed).  (Randomization  is  used  to  avoid  certain 
degeneracies.  In  particular,  with  each  step  no  update  action  is 
taken  with  probability  0<p<7.)  Rosenblatt  advocated  cycling 
through  the  data  and  the  elements  of  the  network  in  such  a 
way  that  each  combination  (of  datum  and  network  element) 
potentially  would  be  considered  infinitely  often.  He 
presented  a  theorem  (Rosenblatt,  p.  294)  to  the  effect  that  if 
the  data  are  separable  by  the  network  (i.e.  there  exist 
parameter  values  for  which  the  network  function  correctly 
classifies  every  point),  then  his  estimation  strategy  will  find 
such  an  error-free  solution  in  a  finite  number  of  steps  (with 
probability  one). 

The  approach  developed  by  R.L.  Barron  et  al.  (1964)  and 
further  explained  in  Moddes  el  al,  (1965),  Cilstrap  (1971),  and 
Barron  et  al.  (1984)  solved  the  multilayer  network  estimation 
problem  by  global  search  to  minimize  the  sum  of  squared  errors 
/eXj,  0))^.  Barron  et  al.  introduced  an  algorithm  called 
guided  accelerated  random  search  (GARS)  which  alternated 
between  global  random  search  (using  a  spherical  normal 
distribution  centered  at  the  current  best  point)  and  local 
gradient  search  (for  which  convergence  was  accelerated  by  a 


halving/doubling  algorithm  for  the  step  size  and  by  adjusting 
a  variable  subset  of  the  parameters  at  the  different  steps). 
The  particular  elemental  functions  originally  used  by  R.L. 
Barron  et  al.  were  quadratic  functions  in  two  variables 
g(z,0)  =  0Q  0jZj  +  SjZj  +  0jZjZ2-  A  spirally-connected 
network  with  24  input  variables  and  seven  layers  was 
constructed  (see  fig.  2).  Using  25-50  observations  of  simulated 
reentry  vehicle  positions  during  a  given  time  frame 
(t,  t  -  At,...,  t  -  7 At),  networks  were  constructed  to  predict  the 
final  position  and  impact  time  of  the  vehicle.  The  parameters 
of  the  networks  were  constrained  to  values  in  the  interval 
between  -1  and  +1.  The  GARS  search  routine  converged  to 
essentially  the  same  extremum  of  performance  for  each  of 
many  randomly  selected  initial  parameter  vectors,  suggesting 
that  a  non-unique  global  optimum  was  reached.  Performance 
on  an  independent  test  set  of  observations  suggested  that 
despite  the  complexity  of  the  network,  and  the  snull  sample 
size,  the  estimated  function  was  not  overfit  to  training  data. 
(However,  overfit  problems  were  later  experienced  with  these 
large  fixed  networks  on  some  industrial  process  modeling 
problems  —  these  experiences  led  in  the  early  1970s  to  the 
adoption  of  adaptive  synthesis  strategies  discussed  below.) 


Fig.  2.  Uniform  Spiral  72-Element  Network 

The  network  of  fig.  2,  which  consists  of  quadratic  two- 
input  elements,  represents  a  family  of  sixth-degree 
polynomials.  Since  the  network  contains  a  total  of  288 
parameters,  this  family  is  a  relatively  low-dimensional 
manifold  in  the  complete  (593,775  dimensional!)  family  of 
sixth-degree  polynomials  in  24  variables.  Nevertheless,  the 
network  had  more  than  enough  flexibility  to  yield  accurate 
approximations  for  the  specific  application  to  re-entry 
vehicle  trajectory  predictions. 

8.  The  Current  Fashion 

In  recent  work  Rumelhart,  Hinton,  and  Williams  (in 
Rumelhart  et  al.  1986,  ch.  8)  propose  that  an  implementation 
of  the  gradient  descent  algorithm  be  used  to  attempt  to 
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minimize  the  sum  of  squared  error  for  multiple  layer 
feedforward  networks.  They  use  element  functions  of  the  form 
(1)  with  h  equal  to  a  logistic  function;  this  choice  is  viewed  as 
a  smoothing  of  the  step  function  to  obtain  a  differentiable 
function  of  the  parameters.  Since  the  network  is  a  composition 
of  functions,  the  derivatives  required  for  the  gradient  method 
are  determined  by  the  chain  rule  of  calculus  (starting  at  the 
filial  node  and  propagating  back  to  the  parameters  in  the  first 
layer).  Although  it  is  recognized  that  the  gradient  method 
may  be  inappropriate  in  general  for  highly  multi-modal 
surfaces,  Rumelhart  et  al.  found  that  it  worked  adequately  on 
the  simple  examples  that  they  considered.  Hinton  and 
Sejnowski  (in  Rumelhart  et  al.  1986,  ch.  7)  propjose  that  a 
sequential  random  search  algorithm  (simulated  armealing)  be 
used  to  estimate  the  parameters  of  a  Hopfield  style  network; 
they  call  their  learning  network  a  Boltzmarm  machine.  These 
papiers  (see  Rumelhart  et  al.  p.  3:,.;)  give  the  impression  that 
multilayer  search  strategies  for  networks  are  novel  to  the 
1980s.  Qearly  this  is  false  in  view  of  the  methods  we  have 
discussed.  In  our  exp)erience  (beginning  in  the  1960s)  a 
combination  of  random  and  derivative-based  search 
strategies,  as  in  the  GARS  algorithm,  is  an  effective  technique 
for  globally  optimizing  networks.  In  any  event,  much  of  the 
recent  work  (as  in  Rumelhart  et  al.)  has  ignored  the 
developments  in  the  1970s  and  1980s  of  the  adaptive  network 
strategies  and  the  nonparamctric  statistical  methodologies 
for  sp>ecific  network  structures. 

9.  Networks  with  Adaptively  Synthesized  Structure 

With  the  propensity  of  large  fixed  networks  to  result  in 
overfit  estimates,  attention  was  turned  in  the  1970s  to 
networks  for  which  the  structure  is  adaptively  determined 
from  the  data.  Such  network  strategics  were  introduced  by 
Ivakhnenko  (1971)  and  their  development  in  the  U.S.  is  traced 
in  Barron  et  al.  (1974, 1975,  1984, 1987). 

The  elements  extensively  utilized  in  these  adaptively 
synthesized  networks  are  second-  and  third-order  pwlynomial 
functions  in  two  variables.  (One  and  three  variable  elements 
are  also  used  in  recent  implementations.)  For  the  method  to 
work ,  the  number  of  inputs  of  each  element  must  be  restricted 
so  as  to  avoid  a  combinatorial  explosion  in  the  number  of 
possibilities  that  the  algorithm  must  check. 

In  brief,  the  basic  strategy  (using  elements  involving  two 
variables)  is  depicted  in  fig.  3.  On  the  first  layer,  all  possible 
pairs  of  the  inputs  are  considered  and  the  best  kj  arc 
tcmpxjrarily  saved.  On  the  succeeding  layers,  all  prossible 
pairs  of  the  intermediate  variables  z  from  the  preceeding 
layer(s)  are  considered  and  the  best  etc  )  are  saved. 

Finally,  when  additional  layers  provide  no  more 
improvement,  the  network  synthesis  stopw.  The  final  network 
consists  only  of  the  ancestors  of  the  final  element. 


X 

X 


1 

2 


Pick 


X 


1 


X 


3 


X 


4-1 


X 


4 


the 

belt 


=D 


Pick 

the 

best 

kj 


Fig.  3.  An  Adaptive  Network  Synthesis  Strategy 

In  the  original  Ivakhnenko  algorithm,  the  parameters 
within  each  element  were  estimated  so  as  to  minimize  on  a 
training  set  of  observations  the  sum  of  squared  errors  of  the  fit 
of  the  clement  to  the  final  dcsirc'd  output.  Cross-validation  on 


a  separate  testing  set  was  used  to  rank  and  select  the  best 
elements  on  each  layer  and  to  select  the  number  of  layers. 
(Ivakhnenko  called  this  division  of  the  data  into  sets  with 
different  purpxjses  in  network  estimation  the  group  method  of 
data  handling,  GMDH.)  The  need  to  construct  complete 
quadratic  polynomials  for  every  pair  of  variables  forced  early 
implementations  of  the  algorithm  to  restrict  the  number  k  of 
temporarily  saved  intermediate  variables  to  be  typically  not 
more  than  16. 

Later  algorithms  develop>ed  by  A.R.  Barron  (1979-1982, 
Polynomial  Network  Training  Routine,  PNETTR  III  and  fV, 
Adaptronics,  Inc.)  incorporated  a  predicted  squared  error  PSE 
criterion  (related  to  the  criteria  of  Akaike  and  Mallows  as 
discussed  above)  at  every  phase  of  element  selection  in  the 
network.  Moreover,  a  method  was  developed  whereby 
candidate  pairs  are  prescreened  before  each  layer  (according 
to  their  predicted  error  in  linear  combination)  thereby 
permitting  more  elements  to  be  considered  on  each  layer 
(typically  k  is  between  30  and  60).  This  also  p>ermitted  more 
complicated  element  calculations,  i.e.  third-degree 
px)IynomiaIs  with  subset  selection  by  the  PSE  criterion.  Also 
the  saved  elements  from  all  preceeding  layers  are  candidate 
inputs  to  a  given  layer.  Moreover,  some  one-  and  three-input 
elements  are  considered  on  each  layer.  The  PNETTR 
algorithm  was  extensively  applied  to  problems  in 
nondestructive  evaluation  of  materials,  modeling  of  material 
characteristics,  flight  guidance  and  control,  target  recognition, 
intrusion  detection  systems,  and  scene  classification;  see 
Barron  et  al.  (1984)  and  the  references  cited  there.  For  an 
application  of  an  earlier  version  of  the  algorithm  to  weather 
forecasting  see  A.R.  Barron  et  al.  (1977). 

The  more  recently  developed  algorithm  by  J.F.  Elder  IV 
(1985-present,  Algorithm  for  Synthesis  of  Polynomial 
Networks,  ASPN,  Barron  Associates,  Inc.)  piermits  a  Aoice  of 
a  minimum  complexity  or  predicted  squared  error  criterion. 
This  algorithm  has  more  user  flexibility  in  the  choice  of  one-, 
two-,  or  three-input  elements  and  in  the  form  of  the 
pwlynomial  elements  (e.g.  the  degree  may  be  adjusted  within 
certain  limits).  Moreover,  at  each  layer  a  new  element  is 
considered  which  is  a  linear  combination  of  all  elements  on 
the  preceeding  layer. 

Currently,  a  major  applications  thrust  is  use  of 
adaptively-synthesized  polynomial  networks  to  initialize 
and/or  re-initialize  (in  real  time)  two-point  boundary-value 
guidance  solutions  for  flight  vehicles  (R.L.  Barron  and  Abbott 
1988).  Polynomial  networks  arc  trained  off-line  on  a  library  of 
simulated  optimum  trajectories  and  interrogated  on-line  with 
information  about  existing  and  desired  vehicle  states. 
Interrogation  yields  numerical  values  of  six  initializing 
adjoint  variables  (Lagrange  multipliers)  in  a  calculus  of 
variations  formulation  of  the  trajectory  optimization  solution. 
Because  each  new  interrogation  answers  the  optimum-path- 
to-go  question,  a  guided  trajectory  need  not  be  restored,  when 
disturbed,  to  a  preconceived  nominal  p>ath,  and  optimality  of 
trajectory  energy  management  and  accuracy  of  guidance  are  not 
compromised  by  disturbances  within  maneuvering  limits  of  the 
vehicle.  In  the  two-point-boundary-value  guidance 
application,  the  role  of  the  polynomial  network  is  to  compress 
a  large  library  of  multivariate  trajectory  information  and 
render  it  in  a  form  (the  network)  suitable  for  virtually 
instantaneous  look-up  and  interpolation. 

Fig.  4  is  a  diagram  for  networks  trained  to  estimate  two  of 
the  initializing  adjoint  variables  for  a  specific  flight  vehicle 
guidance  application.  These  networks  were  synthesized  from 
a  data  base  of  435  observations  of  the  candidate  variables. 
Ten  variables  were  selected  by  ASPN  for  inclusion  in  the  final 
model.  The  information  presented  in  each  box  refers 
respectively  to  the  index  of  the  element  (in  the  list  of 
elements  saved  by  ASPN  during  synthesis),  the  typ>e  of 
clement  (in  terms  of  number  of  inputs),  and  the  number  of  terms 
in  each  cubic  expression  after  pruning  according  to  a  PSE 
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criterion.  The  "white"  element  computes  a  linear  combination 
of  its  inputs. 


Fig.  4  An  Adaptively  Synthesized  Polynomial  Network 
10.  Projection  Pursuit 

The  projection  pursuit  algorithm  of  Friedman  et  al. 
(1974,1981,1984)  which  is  so  popular  in  statistical  circles  has 
not  previously  been  discussed  in  the  context  of  learning 
networks.  This  algorithm  adaptively  synthesizes  a  three- 
layer  network  in  the  form  of  fig.S.  The  first-layer  functions 
implement  linear  combinations  for  ordinary  projection 
pursuit  (or  (i)  for  a  generalization  of  projection  pursuit 

to  be  discussed  below).  The  second-layer  functions  g^fz)  are 
nonparametrically  estimated  functions  of  one  variable. 
Finally,  the  third  layer  simply  takes  a  linear  combination 
Thus  the  function  implemented  is 


Fig.  5.  Network  Diagram  for  Projection  Pursuit 

The  estimation  strategy  of  projection  pursuit  proceeds 
vertically  through  the  levels  indicated  in  fig.S.  On  each 
level,  an  iterative  Gauss-Newton  algorithm  is  employed 
which  alternates  between  estimation  of  the  parameters  0  from 
the  first  layer  and  the  function  from  the  second  layer  so 
that  in  linear  combination  with  the  proceeding  levels  the  fit 
is  optimized  (using  the  sum  of  squared  errors  or  a  likelihood 
criterion).  Here  the  use  of  the  optimized  linear  combination 
g^  is  a  relaxation  method  suggested  by  Lee  Jones  (1986)  as 
an  improvement  over  the  original  method  (which  estimates  g^ 
to  fit  the  error  y  -  ( gj  +  ••+  Xt.j))- 


To  estimate  the  functions  g(z),  Friedman  et  al.  utilize  a 
nonparametric  smoothing  technique  involving  locally  linear 
functions  (the  linear  fit  at  an  arbitrary  point  z  is  estimated 
using  the  data  in  a  neighborhood  of  that  point). 
Nevertheless,  the  methodology  also  works  with  other  one¬ 
dimensional  nonparametric  estimation  techniques  such  as 
smoothing  splines  or  variable  deg-ee  polynomials. 

Projection  pursuit  provides  an  excellent  example  of  a 
learning  network  with  both  parametrically  and 
nonparametrically  estimated  elements.  Also,  it  demonstrates 
an  effective  iterative  strategy  for  estimating  the  elements  of  a 
layer  of  a  network  to  work  well  in  combination  with  each 
other  rather  than  in  isolation. 

An  advantage  of  projection  pursuit  networks  is  that  they 
have  been  amenable  to  theoretical  examination  of  some  of 
their  approximation  properties  (Huber  1985,  Donoho  and 
Johnstone  1985,  Jones  1987),  although  much  work  remains  to  be 
done  in  this  direction.  In  particular  it  is  known  that  any 
square  integrable  function  can  be  approximated  by  a 
theoretical  analog  of  projection  pursuit,  provided  sufficiently 
many  (vertical)  levels  of  the  network  are  utilized;  however, 
the  analogous  result  for  data-driven  estimation  has  yet  to  be 
established. 

11.  Additive  Models  and  Transformations 

Additive  models  represent  functions  of  the  form  Xg/x^)  , 
where  in  general  the  one-dimensional  functions  g^are 
unconstrained  and  in  practice  usually  are  estimated  non¬ 
parametrically.  (In  contrast,  linear  models  estimate  only  the 
coefficients  of  linear  combinations  of  fixed  functions.)  The 
theory  for  the  estimation  of  additive  functions  is  developed  in 
Stone  (1985).  In  particular.  Stone  demonstrates  the  surprising 
result  that,  unlike  general  functions  of  d  variables,  additive 
functions  can  be  estimated  with  a  convergence  rate  for  the 
expected  squared  error  which  is  as  good  as  the  rate  which  can 
be  obtained  for  the  estimation  of  one-dimensional  functions 
(„-2r/(2r«i>  jnjtead  of  „-2r/f2r+i)  „  jj  sample  size,  r  is 

the  assumed  order  of  smoothness,  and  d  is  the  dimension;  see 
section  14  below).  Moreover,  Stone  showed  that  although  not 
every  function  is  additive,  a  best  additive  approximation  to  a 
function  exists  and  can  be  estimated  at  the  indicated  rate. 
Stone's  approach  to  estinrating  the  additive  functions  is  to  use 
finite  dimensional  linear  spaces  of  functions  (such  as  splines, 
polynomials,  or  truncated  trigonometric  series  -  in  particular 
Stone  uses  splines),  so  that  the  resulting  additive 
approximation  is  then  written  in  terms  of  a  linear  function  of 
many  fixed  basis  functions,  in  which  case  traditional  least 
squares  projection  becomes  applicable. 

Winsberg  and  Ramsay  (1980)  and  Tibshirani  (1988) 
generalize  additive  approximation  by  permitting  monotone 
transformations  My)  of  the  dependent  variable.  By  inverting 
this  transformation,  an  approximation  to  the  dependent 
variable  is  obtained  in  the  form  depicted  in  fig.  6  with  g=h'^. 
A  related  model  is  in  Breiman  and  Friedman  (1985)  where 
noninvertible  transformations  h  arc  permitted. 


Fig.  6.  Network  for  Transformations  of  Additive  Models 

Networks  as  in  fig.  6  can  be  estimated  by  alternating 
between  estimates  of  the  transformation  g  and  the  first  layer 
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functions  using  methods  similar  to  projection  pursuit.  In 
particular,  suppose  finite  series  approximations  are  used  for 
each  of  the  functions  Given  a  current  estimate  of  g  (which 
is  assumed  to  be  a  differentiable  function),  a  Gauss-Newton 
type  algorithm  can  be  used  for  the  estimation  of  the 
coefficients  in  a  finite  scries  approximation  of  the  gf^.  Then, 
given  the  current  gj^  the  new  estimate  of  g  can  be  obtained  by 
any  of  several  nonparametric  methods  (e.g.  least  squares 
projection  onto  a  linear  space  of  approximating  functions,  local 
linear  smoothing,  etc.).  These  steps  are  then  iterated  until 
only  negligible  improvement  in  the  optimization  criterion  is 
observed. 

Our  purpose  for  mentioning  additive  models  in  the  context 
of  networks  is  that  this  structure  is  the  one  which  is  best 
understood  theoretically  (except  perhaps  for  linear 
discriminate  functions  and  linear  regressions  which  have  even 
less  approximation  capabilities)  and,  moreover,  the  additive 
structure  is  a  basic  building  block  for  more  elaborate  networks 
which  show  some  promise.  Although  additive  models  cannot 
represent  interactions  between  variables,  interactions  can  be 
obtained  by  taking  sums  of  transformations  of  additive  models 
as  seen  below. 

12.  Generalizations 

It  appears  to  us  that  certain  extensions  to  the  network 
forms  of  projection  pursuit  or  transformations  of  additive 
functions  lead  naturally  to  a  particular  network  structure 
which  is  known  to  have  powerful  approximation  capabilities. 
The  statistical  estimation  strategies  associated  with 
projection  pursuit  and  additive  models  then  lead  to  estimation 
strategies  for  these  more  complex  network  forms. 

In  particular,  consider  networks  of  the  form  given  in  fig.  7. 
This  form  may  be  regarded  as  a  projection  pursuit  network, 
generalized  to  allow  transformations  of  the  original  variables 
on  the  first  layer.  Using  series  approximations  (e.g. 
polynomials)  for  these  transformations,  the  projection  pursuit 
estimation  algorithm  becomes  applicable  to  this  network  as 
discussed  in  section  10.  Alternatively,  the  network  of  fig.  7 
may  be  thought  of  as  a  composition  of  additive  functions. 
Specifically,  the  network  consists  of  2if+7  additive  functions 
with  outputs  Zj,  Zj,...,  say,  which  become  the  inputs  to  a 
final  additive  function  rvith  output  /.  Whereas  none  of  the 
lower  layer  additive  portions  of  the  network  can  approximate 
every  function,  the  composition  of  these  functions  can 
approximate  any  continuous  function  as  discussed  in  section  13 
below.  In  principle,  any  of  the  methods  for  estimating 
transformations  of  additive  models  can  be  used  to  estimate  the 
k'ih  such  function  by  fitting  the  model  to  the  error  resulting 
from  the  sum  of  the  previous  k-l  models.  However,  such 
iterative  approximations  may  require  more  than  the  2d  +  l 
levels  indicated  by  the  theory, 

A  specific  implementation  of  a  generalized  projection 
pursuit  algorithm  which  incorporates  some  of  the  features 
mentioned  above  is  being  developed  by  A.R.  Barron  and  Gayle 
Nygaard.  It  will  permit  the  use  of  polynomial,  spline,  or 
trigonometric  series  approximations  for  any  of  the 
transformations  of  the  network.  A  new  feature  of  this 
algorithm  is  that,  when  estimating  g^  in  fig.  7,  the 
transformations  ^j,  .  - ^*.1  are  backfitted  to  provide  the 
best  additive  combination  by  projecting  to  sums  of  basis 
functions  in  the  manner  of  Stone  (1985),  Moreover,  after  each 
transformation  is  estimated,  a  backward  stepwise  rule  (using  a 
penalized  squared  error  or  complexity  criterion)  is  used  to 
prune  unnecessary  terms  from  each  clement.  In  view  of  the 
relatively  large  (but  fixed)  size  of  the  network  structure,  this 
pruning  of  the  number  of  coefficients  is  essential  to  avoid 
overfit  with  moderate  sample  sizes.  The  most  important 
generalization  is  to  permit  nonparametrically  estimated 
transformations  of  the  variables  so  as  to  achieve  "projections' 
to  surfaces  more  general  than  the  hypcrplancs  utilized  in 


traditional  projection  pursuit.  It  is  then  expected  that  fewer 
numbers  of  projections  arc  required  (perhaps  as  few  as  2d+D. 

13.  Mathematical  Foundations 

Consider  continuous  functions  x^)  of  d  variables  on 
a  bounded  set  such  as  the  unit  cube  10,1]'*.  Upon  reflection  it 
appears  that  all  familiar  functions  of  three  or  more  variables 
are  built  up  from  the  composition  of  various  functions  of  one  or 
two  variables.  (For  instance  a  sum  of  d  variables  is  a 
composition  of  d-\  bivariate  sums.)  Accustomed  to  the  traps  of 
mathematical  analysis,  one  might  speculate  that  there  exist 
truly  d-dimensional  functions  that  cannot  be  represented  in 
this  way.  On  the  contrary,  Kolmogorov  (1957),  see  also 
Lorentz  (1966),  proved  the  surprising  result  that  every 
continuous  function  on  10,1]'*  can  be  exactly  represented  as  a 
composition  of  sums  and  continuous  one-dimensional  functions. 

Lorentz  (1966)  identified  a  particular  composition  scheme 
(depicted  in  fig.  2)  which  works  for  all  functions  of  a  given 
dimension.  For  any  continuous  function  /  on  10,1]'* ,  there  exist 
continuous  one-dimensional  functions  g^and  for 
/=I,  2,...,2d+1  and  k=t,2,..,,d  such  that 

. 

Moreover,  Lorentz  demonstrated  the  existence  of  universal 
functions  hyi^  which  do  not  depend  on  the  function  /  (whereas 
the  gy  do  depend  on  / ).  In  his  proof,  Lorentz  constructs 
piecewise  linear  functions  g-’-^  with  the  property  that  for 
every  x  in  the  cube  the  majorily  (i.e.  at  least  d  +  I)  of  the 
values  gy*'*  fZAyufz))  (for /=2,,..,  2d-^J)  are  within  £  of  f(x). 
(This  proof  suggests  that  it  might  be  more  natural  to  use  the 
median  ofgj  (Shjj  ■f'si^ad  of  the  sum 

to  approximate  /.)  The  proof  of  the  existence  of  an  exact 
representation  involves  a  careful  limiting  argument  with 
€-*0. 


Fig.  7,  Kolmogorov-Lorentz  Network 


In  general  the  functions  g^  for  which  the  representation  is 
valid  may  be  rather  irregular  (e  g.  nondifferentiable).  It  is 
reasonable  to  expect,  that  for  sufficiently  regirlar  functions  /, 
relatively  smooth  elements  g, and  h.^  can  be  used  in  the  re¬ 
presentation,  especially  if  the  hji^  arc  allowed  to  depend  on  /. 

One  way  to  quantify  the  smoothness  of  a  function  is  the 
characteristic  s.  A  function  of  d  variables  has  characteristic 
s  =  pid,  where  p  =  r  +  a  if  all  derivatives  of  order  r  arc 
Lipshitz  continuous  of  order  a  where  0  <  a  <1  (this  is  the  case 
with  a  »J,r  =  p-  lif  the  derivatives  of  order  p  arc  bounded). 
(This  smoothness  characteristic  is  used  by  Stone  (1982)  to 
obtain  minimax  rates  of  convergence  of  nonparametric 
estimators,  sec  below.)  Kolmogorov  (1959),  see  also  Lorentz 
(1966),  proved  that  not  every  function  with  a  gfeen  smoothness 
characteristic  can  he  represented  as  a  composition  of  functions 
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having  a  larger  smoothness  characteristic.  This  means,  for 
instance,  that  there  exist  functions  of  ten  variables  which  arc 
differentiable  up  to  order  ten  that  cannot  be  represented  by 
compositions  using  one-dimensional  functions  having  more 
than  one  derivative. 

The  limitations  expressed  by  these  theoretical  results  do 
not  preclude  the  possibility  that  many  of  the  practically 
occurring  functions  which  one  might  wish  to  estimate  are 
representable  in  terms  of  low-dimensional  functions  of  large 
smoothness  characteristic.  For  ir\stance,  it  might  be  true  that 
infinitely  differentiable  functions  can  be  represented  in  terms 
of  compositions  of  infinitely  differentiable  functioits  of  low 
dimensionality. 

The  appeal  of  the  Kolmogorov-Lorentz  representation 
compared  to  other  familiar  network  structures  is  the  economy 
of  network  nodes.  A  fixed  number  of  one-dimensional 
continuous  functions  (namely  (il+l)(2d+l))  suffices  to  give  an 
approximation  or  even  an  exact  representation. 

Other  network  structures  are  known  to  possess 
approximation  capabilities,  but  generally  the  number  of 
network  nodes  depends  on  the  function  being  approximated  and 
the  desired  acairacy.  Subsequent  to  our  Interface  presentation, 
George  Cybenko  informed  us  of  some  of  his  recent  results 
(Cybenko  1988).  Consider  three-layer  networks  in  which  the 
clement  in  the  final  layer  takes  a  linear  combination  of  its 
inputs  and  the  first  two  layers  arc  restricted  to  elements  in  the 
form  of  equation  (1),  each  of  which  uses  the  same  nonlinear 
transformation  h.  This  function  h  is  permitted  to  be  any  fixed 
continuous  strictly  increasing  function  with  bounded  range. 
Cybenko  proved  that  for  any  continuous  function  f  on  a  d- 
dimensional  cube  and  any  e  >  0,  there  exists  a  three-layer 
network  with  elements  of  the  form  (V  that  approximates  f 
with  error  uniformly  less  than  £■  His  proof  is  to  show  that  the 
first  two  layers  of  the  network  may  be  used  to  implement 
kernel  functions  ("approximations  to  the  identity")  of 
appropriate  bandwidths  having  arbitrary  centers,  from 
which  the  result  follows  by  taking  an  appropriate  linear 
combination.  Cybenko  also  points  out  that  two-layer  networks 
arc  sufficient  if  quadratic  ^  functions  are  used  in  first  layer 
elements  of  the  form  (3),  for  then  certain  kernel  functions  may 
be  constructed  by  taking  linear  combinations  of  these  elements. 
Although  Cybenko  docs  not  refer  to  the  rich  collection  of 
statistical  literature  on  kernel  approximation  (see  the  books 
by  Prakasa  Rao  1983,  Devroye  1987,  or  Eubanks  1988),  it  is 
apparent  that  results  in  this  area  could  be  utilized  to  bound 
the  number  of  kernels  (and  hence  the  number  of  nodes  in 
Cybenko's  networks)  required  to  achieve  a  given  accuracy. 

Some  basic  results  in  mathematical  analysis  which  have 
impact  on  the  approximation  capabilities  of  network  forms 
should  not  be  overlooked.  The  Wcicrstrass  theorem  and  its 
generalization  to  multivariate  functions  asserts  that  any 
continuous  function  on  |0,1J‘*  can  be  uniformly  approximated  by 
a  sufficiently  large  degree  polynomial.  The  polynomial 
approximations  need  not  be  restricted  to  the  canonical  sum  of 
products  form  S0jj:j‘t...x^*‘t(which  is  itself  a  large  network  of 
simple  structure),  indeed,  the  multivariate  generalization  of 
Weierstrass's  theorem  is  seen  to  be  an  immediate  corollary  to 
the  Kolmogorov-Lorentz  representation  theorem 

Other  multivariate  forms  are  known  to  approximate 
arbitrary  continuous  functions.  For  instance,  finite 
trigonometric  sums  ^^(a^cosi nk  x)  +  P^sinlitk  x))  can 
uniformly  approximate  any  continuous  function  on  10,11“*, 
provided  the  function  is  continously  extended  to  satisfy 
boundary  conditions  on  1-1,1)“*  (see  Lorentz  1966,  p.87).  Here 
k  =  (  k,  ,...,  )c, )  and  k  x  =  I.  k  ,x  .  We  remark  that  the  sin  and 
cos  functions  have  bounded  variation,  so  they  can  be 
represented  as  the  difference  of  monotone  functions  h. 
Consequently,  the  trigonometric  sum  is  a  two-layer  network 
with  first  layer  elements  having  the  form  (1).  This  gives  a 
simple  proof  of  Cybenko's  theorem  specialized  to  such  h. 


The  Jackson  theorems  express  bounds  on  the  accuracy  of  a 
polynomial  or  trigonometric  approximation  in  terms  of  the 
assumed  smoothness  of  the  function  being  approxinaated.  (Se^ 
Jackson  1930  for  a  lucid  treatment  of  the  univariate  case  and 
Lorentz  1966,  especially  pp.  87-90,  for  multivariate 
extensions.)  For  instance,  if  a  function  /  has  partial 
derivatives  yfldx{of  order  r^O  which  are  Lipshitz  of  order 
0  <  a  rS  1,  then  there  is  a  constant  c  such  that  for  every  N  ^  I  a 
polynomial  approximation  of  degree  N  (in  each  coordinate) 
exists  with  error  uniformly  less  than  cN'f  ,  where  p  =  r  +  a. 
Unfortunately,  Jackson  type  theorems  are  not  known  for 
polynomial  approximations  which  take  a  network  form  other 
than  a  sum  of  products. 

14.  Some  Limitations  on  the  Statistical  Accuracy  of  Learning 
Networks 

In  practice,  learning  network  approximations  are  not 
obtained  from  completely  known  functions,  but  rather  they  are 
estimated  from  a  trairting  sample  of  observations  of  relevant 
variables.  The  sample  is  typically  a  sequence  of  inpuf/output 
pairs  Xj ,  Y],  ...,  X^,  Y^  which  is  assumed  to  possess  one  of 
several  possible  probabilistic  structures  as  discussed 
previously.  There  is  a  fundamental  question  which  is 
addressed  for  this  class  of  problems;  What  is  the  relationship 
between  the  achievable  accuracy  and  the  size  n  of  the  sample? 
Typically  it  is  found  that  the  airswer  depends  on  the  class  of 
possible  functions.  Especially  critical  arc  the  dimension  d  and 
the  regularity  of  the  function.  Results  from  approximation 
theory  play  a  key  role  in  these  statistical  considerations.  Th'' 
presently  known  answers,  which  we  discuss  below,  are 
somewhat  discouraging,  especially  with  regard  to  practical 
contraints  imposed  on  the  dimensionality.  To  understand 
better  and  to  avoid  the  pitfalls  of  high  dimensionality,  it  is 
suggested  that  new  approximation  theory  and  estimation 
results  are  needed  for  specific  network  composition  strategies. 

Stone  (1982)  has  fundamental  results  concerning  a  class  of 
nonparametric  estimation  problems  which  includes  curve  or 
surface  fitting  with  normally  distributed  errors  and  binary 
classification  with  unknown  conditional  class  probability 
functions.  Attention  is  restricted  to  functions  on  a  bounded  set 
with  a  given  smoothness  characteristic  s  =  p/d  (in  the  sense 
that  all  cross  partial  derivatives  of  total  order  r  are  Lipshitz 
of  order  a  and  p  =  r  +  a  as  above).  Stone  establishes  that  the 
optimal  rate  of  convergence  is  =  n  ‘**^'***  for  the  L“l  norms 
y9<q«’‘>)  and  £■_  -  'n”*  log  n)*/*^'***  L”  norm.  This  means 

A 

that  there  exist  estimators  f^  (depending  only  on  the  sample) 
such  that  the  ratio  1 1  /„  1 1  /e„  is  bounded  in  probability  for  all 
functions  /  of  the  given  smoothness  class.  Conversely,  for  any 
sequence  of  estimators/^  there  exist  sequences  of  functions  / of 
the  given  smoothness  class  for  which  the  ratio  !!/-/„  1 1  /  is 

bounded  away  from  zero  in  probability,  as  n  To  achieve 
the  optimal  rate  of  convergence,  Stone  (1982)  uses  local 
polynomial  regression.  The  value  of  the  estimator  (x>  at  a 
point  X  is  obtained  by  a  weighted  least  squares  polynomial  fit 
using  all  data  points  for  which  the  distance  from  t  is  less  than 
<5,j.  Stone  chooses  the  sequence  5,,  to  converge  to  zero  at  'a>  ? 
n  tl<2p*d>  chcKises  the  local  polynomials  to  have  total 

degree  r. 

For  convergence  of  the  mean  integrated  squ.ired  error 
(MISE)  uniformly  over  all  functions  which  have  a  bound  on 
the  L^  norm  of  dcrivatices  of  order  p,  the  optimal  convergence 
rate  is  of  the  form  Indeed,  a  consequence  of  Stone's 

result  is  that  this  asymptotic  rate  cannot  be  improved.  This 
rate  is  achieved  in  regression  contexts  by  multivariate 
smoothing  splines  (Cox  1984)  and  in  some  cases  by  least  squares 
polynomial  regression  and  trigonometric  scries  regression,  sec 
Cox  (1988).  A.  R.  Barron  (1988)  has  analogous  results  (or  the 
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estimation  of  a  log-density  function.  For  the  specia'  case  d=I, 
asymptotic  (aivd  in  some  cases  exact)  minimax  estimators  are 
found  in  Efroimovich  and  Pinsker  (1983)  for  density 
estimafirn,  and  Nussbaum  (1985)  and  Speckman  (1985)  for 
regression.  In  these  univariate  cases  the  constant  cfp)  is 
determined  in  the  asymptotic  minimax  error 
For  d>l,  it  appears  that  the  corresponding  constant  c(p,d)  for 
exact  asymptotics  is  not  yet  explicitly 

determined.  Determination  of  the  behavior  of  this  constant 
for  large  i  would  be  useful,  since  it  would  help  determine 
whether  practical  minimax  estimation  is  possible  in  high 
dimensions. 

Observe  that  unless  the  degree  of  smoothness  p  is  large 
compared  to  the  dimension  d,  the  optimum  rate  of  convergence 
^-pl(2p*d)  disappointingly  slow.  For  instance,  with 
dimension  d  =  8  and  smoothness  p  =  2,a  sample  of  size  n  210^ 
(one  million!)  would  be  required  to  make  be  not 

greater  than  1/10. 

Tlie  slow  rates  for  optimal  estimation  of  smooth  functions 
in  high  dimensions  suggest  that  to  understand  the  practical 
success  of  certain  high-dimensional  estimation  strategies  it 
may  be  necessary  to  use  notions  of  the  regularity  of  a  function 
other  than  differentiability  to  quantify  the  limits  o  n 
statistical  accuracy.  One  possibility  is  to  assume  proximity  of 
the  desired  function  to  functions  of  low  Kolmogorov 
complexity.  It  may  then  be  possible  to  obtain  rate  of 
convergence  results  as  well  as  the  consistency  results  referred 
to  in  section  5  (for  networks  selected  by  complexity 
regularization).  This  is  a  topic  of  further  investigation. 

In  recent  work  by  Baum  ^and  Haussler,  the  Vapnik- 
Chervonkis  dimension  of  families  of  network  functions  is 
characterized  and  used  to  quantify  the  statistical  reliability 
of  estimated  networks  for  binary  classification.  Using  results 
of  Cover  (1965, 1967)  on  the  number  of  possible  dichotomies  of 
a  sample  by  networks  of  thresholded  linear  elements,  Baum 
(1988)  has  bounded  the  Vapnik-Chervonkis  dimension  in 
terms  of  the  total  number  of  coefficients  in  the  network.  Let 
0<£j<e2<l  be  given.  Suppose  it  is  observed  that  the  fraction  of 
errors  of  an  estimated  network  is  less  than  £j  on  a  training 
sample  of  size  n.  Then  it  is  of  interest  to  bound  the  conditional 
probability  that  a  fraction  of  at  least  ^2  errors  will  be  incurred 
by  this  network  on  an  independent  test  sample.  Baum  and 
Haussler  (1988)  have  some  results  in  this  direction,  assuming 
that  the  total  number  of  coefficients  is  sufficiently  small 
compared  to  the  sample  size. 

The  advantage  of  the  Baum  and  Haussler  approach  is  its 
uesfulness  in  retrospective  analysis:  i.e.,  given  that  an 
accurate  estimate  has  been  found  on  training  data,  what  is  the 
probability  of  error  likely  to  be  on  new  data?  This  approach 
avoids  questions  concerning  the  approximation  capabilities  of 
a  network;  in  particular,  the  probability  that  an  estimated 
network  will  achieve  a  certain  accuracy  is  not  determined. 

15.  Conclusions 

Historically,  neural  networks,  adaptive  polynomial 
learning,  and  nonparametric  statistical  inference  are  fields  of 
inquiry  with  distinct  perspectives  and  separate  lines  of 
development  which  have  crossed  paths  only  on  occasion. 
However,  by  examining  the  purpose,  scope,  and 
methodologies  in  these  fields,  considerable  commonality  is 
revealed.  In  each  case,  network  functions  are  used  to 
approximate  possibly  complex  multivariate  relationships  by 
composition  of  many  simpler  relationships.  Moreover, 
strategics  for  the  synthesis  of  these  nelw!,-ks  from  observable 
data  are  developed.  To  understand  the  peilurmance  of  the'se 
strategics  and  to  suggest  improved  methodologies,  practical 
experience  is  supplementcxi  by  an  understanding  of  the  basic 
disciplines  of  mathematical  approximation  theory  and 
statistical  decision  theory.  Conversely,  it  behooves  the 
practitioner  in  multivariate  nonparametric  statistical 


inference  to  become  aware  of  the  benefits  and  experiences  in 
the  use  of  multiple-layered  networks  for  classification, 
regression,  and  related  problems. 

In  our  experience  the  most  successful  learrung  network 
methodologies  adaptively  grow  the  network  structure,  using 
all  the  observational  data  (in  batch  rather  than  recursively) 
and  using  an  appropriate  model  selection  criterion  to  ensure  a 
parsimonious  network.  Moreover,  the  best  strategies  employ 
network  structures  which  are  not  limited  in  their 
approximational  capabilities.  The  principle  examples  of 
these  successful  methodologies  are  adaptively  synthesized 
poljTiomia]  networks  and  projection  pursuit. 

It  appears  to  us  that  several  different  approaches  'ead 
inevitably  to  one  network  structure  and  similar  synthesis 
strategies;  namely  the  netwo;  .  of  fig.  7  (introduced  by 
Kolmogorov  and  Lorentz)  estimated  by  a  generalization  of 
projection  pursuit  which  incorporates  additive  projections  or 
estimated  by  polynomial  network  strategies  specialized  to 
this  structure.  This  network  considerably  extends  the 
capabilities  of  existing  projection  pursuit  and  additive 
regression  models,  yet  retains  enough  of  the  regularity  of  these 
models  that  it  may  be  amenable  to  further  theoretical  and 
practical  examinations  of  its  properties.  Nevertheless,  we 
should  not  restrict  all  attention  to  just  one  network  structure. 
Hopefully,  by  consideration  of  a  variety  of  different 
compositions,  empirically  selecting  the  best  (say  by 
complexity  regularization),  discovery  of  the  true 
relationships  can  occur. 


Appendix;  Convergence  of  networks  estimated  by  complexity 
regularization 

In  Ibis  appendix  we  specialize  some  results  from  A.R. 
Barron  (1985,1987)  to  show  convergence  of  estimates  of  net¬ 
work  functions.  In  general  the  theory  is  concerned  with  the 
selection  of  a  probability  distribution  using  random  data 

=  (B' I. IF 2 . VF,).  It  is  assumed  that  T  is  a  countable  col¬ 

lection  of  probability  distributions  which  are  candidates  for  the 
c.siimate  of  the  distribution  of  the  process  H'j.lFj,...  and  that 
L(P),Pe  r  arc  positive  numbers  which  satisfy  the  Kraft- 
McMillan  inequality  ^  (Here  L(P)  may  be 

regarded  as  the  length  of  a  uniquely  de-odable  code  or 
may  be  regarded  as  ,i  discrete  prior  probabilitr  ^  Short  lengths 
L(P)  are  desired  for  as  large  as  possible  a  set  of  distributions 
that  can  be  computed,  so  ideally,  we  would  let  HP)  be  the 
Kolmogorov  complexity  (relative  to  .a  fixed  universal  computer) 
and  r  would  be  the  set  of  all  computable  distribution.^;  how¬ 
ever,  the  determination  of  .such  an  ideal  complexity  is  practi¬ 
cally  inicasible.  Neverth  less,  the  comp’cxity  principle  provides 
a  useful  guide  in  selectin'-  reasonable  sets  of  distributions  an.l 
assigning  priors  geared  toward  parsimonious  distributions 
When  the  disiribution  is  known  except  for  a  function  f  of  d 
variables  on  which  the  distribution  Pj  depends,  then  families  of 
network  functions  and  corresponding  description  lengths  can  be 
used  to  yield  an  effective  criterion  for  se  lecting  an  appropriate 
network. 

In  general  the  complexity  rcftularizaiion  estimator  P^  is 
defined  to  achieve 

min(-  log/7'(lF  . IV',)  +  HP\\  (9) 

Here  the  dcn,sity  fiinclions  p"  are  taken  with  rcspcci  lo  a  fixed 
dominating  measure.  I.ogariihms  are  taken  base  2.  When 

IF  1 . IF,  arc  discreti/cd  random  variables,  then 

-  logp(IF|,  ..IF,)  (upon  rounding  up  lo  the  nearest  inleg. i)  is 
the  length  of  a  Shannon  code  for  these  sanahics  based  on  the 
disliibution  P  a-'-1  the  term  HP)  is  the  length  of  a  preamble 
required  to  specify  which  disiribulion  A  more  general  form  of 
coinplcxiiy  regulari/alion  is  lo  minimize 

CR  -  log  /I'l  If  " )  ‘  k  /,  I  /> )  (  10) 
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where  X  may  be  regarded  as  a  Lagrange  muliiplier.  Unless 
X.  =  1.  CR  docs  noi  have  the  same  total  description  length 
interpretation.  Nevertheless,  the  solutions  P„  which  minimize 
CR  for  X  >  0  do  have  the  valid  interpretation  as  maximum 
likelihood  estimators  subject  to  complexity  constraints.  Such 
estimators  were  first  proposed  by  Cover  (1972).  Our  conver¬ 
gence  results  require  that  X  >  1  be  fixed,  although  in  one  ease 
X  >  1  IS  required. 

We  mention  several  general  convergence  results.  First 
suppose  that  the  distributions  F  in  F  are  stationary  and  crgodic. 
Let  P'  denote  the  true  probability  law  which  governs  the  pro¬ 
cess.  The  first  rc.sult  is  that  if  P'  c  F  then  the  estimated  distribu¬ 
tion  IS  exactly  correct.  P,  =  P' .  for  all  large  n.  with  probability- 
one.  For  the  remaining  results  suppose  that  the  variables  W, 
arc  independent  and  identically  distributed  with  rc.spcct  to  P’ . 
and  likewise  that  independence  holds  for  the  distributions  in  F. 

whence  p(W  . . W,)  =  Ur  (IV,).  Moreover,  it  is  assumed 

that  the  true  density  function  p'  can  be  approximated  by  densi¬ 
ties  in  r  in  an  information  theoretic  sense:  that  is,  there  exist 
densities  in  F  for  which  the  relative  entropy  |  p‘  iogp’tp  is 
arbitrarily  small.  This  leads  to  the  second  result  that  P,  =>  P‘ 
(in  the  sense  of  weak  convergence)  with  probability  one; 
moreover,  if  the  densities  in  F  arc  uniformly  cquicontinuous 
then  p,  — *  p"  in  Since  the  uniform  cquicontinuity  is  not 
easy  to  guarantee  m  general,  we  mention  a  third  result  which 
makes  no  such  requirement  //  X  >  1  and  if  densities  in  F  can 
approximate  p'  in  the  relative  entropy  .sense,  then  p,  —*  p’  in  L'. 
that  is  lim  |  Ip, -p”  I  =  0.  with  probability  one.  The  second  and 
third  results  continue  to  he  valid  (with  convergence  in  probabil¬ 
ity  statements  replacing  convergence  with  probability  one) 
when  the  set  F,  and  the  numbers  L^(P)  arc  allowed  to  depend 
on  the  sample  size  n.  provided  that  there  exists  a  sequences  of 
densities  p,  in  F,  for  which  lim  j  p'  logp*/p,  =  0  and 
lim  L,{P,)ln  =  0, 

For  the  estimation  of  network  functions  we  lake 
"  ,  =  which  is  assumed  to  have  a  distribution  Pf  which 

depend  on  the  function  we  desire  to  estimate.  A  denumerable 
f possibly  finite)  collection  .S',  of  parameterized  families  of  net¬ 
work  functions  /(x.0)  is  considered.  We  assume  that  the 
sequence  of  collections  is  increasing  i'.c.SjC  ■■  and  that 
L(/)./e.S  are  lengths  of  codes  which  specify  the  structure, 
but  not  the  parameter  values,  of  networks  in  S  =  t^  ,S,.  For 
each  network  family  /,  the  parameter  vector  (which  has  dimen¬ 
sion  denoted  by  kf),  is  assumed  for  convenience  to  lake  values 
in  the  unit  cube  10,1)  F  (Families  with  larger  rectangular 
parameter  spaces  can  be  reduced  to  this  case  by  scaling  and 
appropriately  modifying  the  definition  of /).  We  restrict  atten¬ 
tion  to  the  lattice  of  points  with  coordinates  of  the  form 

I  vn  for  integers  0<  i<t  '1  n  and  we  use  (1/2)  log  n  bus  per 
parameter  to  describe  lhe.se  points. 

For  each  parametrized  network  /(i,9)  in  S,,  let  9,  be 
estimated  by  the  method  of  maximum  likelihood  restricted  to 
the  parameter  values  of  the  given  precision.  Thus  0,  achieves 

pflV  l/(  .0,))  =  max  7’()V"  l/(  .0)).  (II) 

0€  11.  , 

■  / 

I  he  complexity  regularization  estimator  is  the  network  /, 
.lefincd  to  achieve 

m  in^ ,  (-  log  p(  VV  "  I  /  (  .6, ) )  1  X  log  n  4  X  L  <  f  )]  (12) 

W'c  remark  that  other  precisions  than  (1/2)  log  n  bits  could  be 
used  in  the  definition,  provided  the  maximum  likelihood  esti¬ 
mator  IS  suitably  restricted.  (For  smooth  families,  a  second 
order  Taylor  senes  argument  shows  that  the  present  choice 
achieves  roughly  the  best  tradeoff  between  complexity  and 
likelihood.  In  some  cases  an  improved  tradeoff  is  obtained 
using  local  reparamcirizaiions  as  dictated  by  the  Fisher  informa¬ 
tion  matrix,  as  in  A.R.  Barron  (1985.  p.  74).  With  X  =  1.  the 
specialization  of  the  complexity  regularization  criterion  given  ii 
(12)  is  very  much  the  same  as  Rissanen's  MDL  criterion. 


However,  the  L(/)  term  (omitted  by  Rissanen)  can  be  impor¬ 
tant,  especially  when  there  is  a  large  variety  of  families  under 
consideration. 

As  a  special  case  of  interest  consider  function  fitting  prob¬ 
lems  with  Gausssian  errors.  In  this  ease,  for  given  X,  the  con¬ 
ditional  distribution  of  the  error  F-f  (X)  is  normal  with  mean 
zero  and  variance  o^.  The  X.  are  assumed  to  be  randomly 
selected,  independently,  from  a  distribution  which  does  not 
depend  on  /.  Then  the  complexity  regularization  criterion 
reduces  to 

CR  =  -^i:(>',-/(X,,9))^  4  X  log  n  4  X  (.(/).  (13) 

Let  /*  be  the  iruc  function  which  we  desire  to  estimate. 
Assuming  the  the  network  in  5  arc  continuous  functions  of 
their  parameters,  the  information  theoretic  closure  condition 
reduces  (in  the  Gaussian  ease)  to  the  condition  that 
(/*(X  )-/ (X  ,0))^,  i.e.  the  true  function  must  be 
approximablc  in  the  sense  by  members  of  network  families 
under  conxidcralion.  In  which  case,  networks  /„(^)  which  arc 
selected  to  minimize  (13)  (with  X>  1)  arc  guaranteed  to  con¬ 
verge  to  /*(X  )  in  probability. 
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ABSTRACT 

Recent  progress  in  modelling  connec- 
tionist  ("neural")  networks  gives  rise  to 
the  expectation  that  future  computing 
systems  will  employ  coprocessors  in  which 
large  numbers  of  memory less,  nonlinear 
processing  units  interact  through  plastic 
connections.  Hopfield  has  drawn  atten¬ 
tion  to  symmetrically  interconnected  net¬ 
works  of  binary  threshold  units.  These 
collective  computation  networks  converge 
rapidly  to  stable  states  corresponding  to 
local  minima  of  the  computational  energy. 
The  network  can  be  freed  from  local  min¬ 
ima  by  the  addition  of  noise  at  the  input 
of  each  neuron-like  unit.  The  state  then 
takes  a  random  walk  on  the  2^  vertices  of 
a  hypercube,  where  N  is  the  number  of 
"neurons".  This  paper  uses  a  simple,  ex¬ 
plicit  algorithm  to  study  the  behavior  of 
collective  computation  networks  with  ad¬ 
ditive  noise.  The  algorithm  gives  rise 
to  a  stationary  Boltzmann  distribution  of 
the  network  state.  Formulas  for  the  tem¬ 
peratures  of  non-logistic  noises  are 
derived  and  tested  in  Monte  Carlo  trials. 

INTRODUCTION 

This  concerns  one  of  the  folk  theorems 
of  statistical  neurodynamics,  which  holds 
that  the  states  of  a  globally  asymptoti¬ 
cally  stable  neural  network,  subjected  to 
isothermal  agitation,  occur  with  relative 
frequencies  given  by  the  Boltzmann  dis¬ 
tribution.  Global  asymptotic  stability 
follows  from  the  existence  of  an  energy 
function  H  of  the  networks  state  S.  This 
was  discovered  by  Hopfield  [3],  whose  ex¬ 
pression 


N  N-1  N 

H(8)  =  -  E  ESiSjTij  -  E  SiUi  (1) 
j=i+l  i=l  i=l 

for  the  computational  energy  of  the  net¬ 
work  is  analogous  to  the  Hamiltonian  of  a 
collection  of  interacting  magnetic  di¬ 
poles.  Here  Sj^  is  the  state  of  the  i-th 
neuron-like  element — either  firing  at  its 
peak  rate  (Sj|^=l)  or  resting  (5^=0)  ;  and 
the  connectivity  matrix  ||Tjj||  gives  the 
strength  of  the  "synapse"  through  which 
the  i-th  unit  excites  (if  Tj.:>0)  or  in¬ 
hibits  (if  T|j<0)  the  j-th  unit.  The 
state  of  the  i-th  unit  is  decided  by  a 
threshold  test  applied  to  its  input, 

N 

Xi  =  E  SjTji  +  U^.  (2.1) 

j  =  l 


The  binary  "McCulloch-Pitts  neuron"  obeys 
the  rule 


1  if  >  0 
0  if  <  0. 


(2.2) 


When  the  T-matrix  is  real-valued  and  sym¬ 
metric,  with  all  zeros  on  the  diagonal, 
the  network  evolves  toward  stable  states 
which  correspond  to  local  minima  of  the 
computational  energy.  The  "energy  land¬ 
scape"  can  be  configured  so  that  these 
local  minima  correspond  to  solutions  of 
constrained  optimization  and  pattern 
recognition  problems  [7,8].  In  the  lat¬ 
ter  case,  the  vector  D  of  inputs  to  the  N 
units  might  represent  the  pixel  pattern 
on  a  retina. 


"BOLTZMANN  MACHINES" 

A  provocative  paper  by  Ackley,  Hinton 
and  Sejnowski  [1]  proposed  simulated  an¬ 
nealing  to  dislodge  the  Hopfield  network 
from  local  minima  and  enable  it  to  settle 
into  states  of  still  lower  energy  which 
would  represent  better  (if  still 
suboptimal)  solutions.  The  network  is 
"heated"  by  the  addition  of  noise  to  the 
input  of  each  unit.  When  these  noises 
are  independent,  identically  distributed 
random  variables,  the  state  8  takes  a 
random  walk  on  the  2”  vertices  of  a  hy¬ 
percube.  The  stationary  distribution  is 

Pr(8  =  8)  =  exp[-0H(s)]/  E  exp[-/3H(8' )  ]  • 


The  assertion  of  Ackley,  Hinton  and  Sej¬ 
nowski,  that  1/0  =  T  is  the  root  mean  in¬ 
tensity  of  noise  described  by  a  logistic 
distribution,  was  not  powerfully 
motivated.  Shaw  et.  al.  [6]  had  earlier 
arrived  at  an  expression  like  (3)  in 
which  /3  is  a  "smearing  factor"  determined 
from  details  of  a  stochastic  model  of  the 
chemical  synapse. 

It  was  over  a  hundred  years  ago  that 
Gibbs  sought  time-invariant  solutions  to 
a  Liouville  equation  in  which  the  inde¬ 
pendent  variables  were  the  Hamiltonian 
coordinates  of  a  multiparticle  system  and 
the  dependent  variable  was  the  probabil¬ 
ity  of  the  system  being  in  a  given  state. 
He  arrived  at  a  canonical  ensemble  in 
which  "the  index  of  probability  [ie.,  the 
log-probability]  is  a  linear  function  of 
the  energy"  of  the  state.  This  result  is 
expressed  by  equation  (3),  called  a 
Boltzmann  distribution.  Other  functions 
of  the  energy,  however,  will  serve  this 
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purpose;  and  the  fact  that  the  linear  de¬ 
pendence  (of  log-probability  on  energy) 
maximizes  the  entropy  of  the  system  is 
not  necessarily  germane  to  the  question- 
Belief  in  the  possibility  of  a  mathemati¬ 
cal  treatment  of  biological  intelligence, 
patterned  after  statistical  thermo¬ 
dynamics,  goes  back  at  least  as  far  as 
the  works  of  John  Von  Neumann,  published 
posthumously.  For  this  belief  to  find 
expression  in  contemporary  neural  network 
research  is  not  surprising.  The  mathe¬ 
matician  who  studies  this  work  must  be 
slightly  bewildered  by  derivations  which 
appeal  to  analogies  with  statistical 
physics,  some  of  which  are  complicated  by 
psychological  theory  [5].  The  validity 
of  the  Boltzmann  distribution,  in  the 
context  of  the  connectionist  paradigm,  is 
solely  dependent  on  the  existence  of 
models  which  give  rise  to  it.  As  far  as 
real  neuronal  networks  are  concerned,  the 
laboratory  experiments  which  would  verify 
the  result  have  yet  to  be  defined. 


The  Algorithm 

The  computational  technique  of  simu¬ 
lated  annealing  traces  its  roots  to  the 
Metropolis  [4]  algorithm,  which  updates 
the  state  of  an  N-particle  system  accord¬ 
ing  to  a  stochastic  model  in  which  the 
Boltzmann  distribution  is  expressly 
assumed  beforehand.  An  alternative 
derivation  due  to  c.  R.  Darnafalski,  the 
amateur  mathematician  whose  unpublished 
essays  have  been  cited  elsewhere  [2], 
involves  the  following  stochastic  model: 
Pick  an  integer  ie{l,...,N)  at  random. 
Compute  Xi  according  to  (2.1).  Modify  Xj^ 
by  the  addition  of  a  real  random  vari¬ 
able,  call  it  Yj^,  which  is  symmetrically 
distributed  about  a  mean  of  zero.  Com¬ 
pute  according  to  (2.2).  These  steps 
are  iterated  indefinitely  with  indepen¬ 
dent,  identically  distributed  random  num¬ 
bers  (Yw,  k=l,2,...).  It  is  not  hard  to 
see  that  this  gives  rise  to  a  sequence 
{S]j,  k=  1,2,...)  of  states  which  con¬ 
stitute  a  Markov  chain.  Nonzero  prob¬ 
abilities  are  attributed  to  transitions 
which  involve  at  most  one  component  of 
the  state  vector.  With  no  external  input 
(0=0),  these  probabilities  depend  on 
the  Tjj  and  the  distribution  of  Y,  as 
described  in  the  Appendix.  When 
Hopfield's  conditions  are  obeyed  by  the 
former,  the  stationary  distribution  can 
be  derived  analytically.  This  distribu¬ 
tion  is 


Pr(8  =  8)  = 


2“^  exp 


N  N-1 

Z  E  SiS^log{F(Ti.)/[l-F(TiO ] ) 
j=i+l  i=l  - 


(4) 

)• 


in  which  F  is  the  (cumulative)  distribu¬ 
tion  function  of  Y  and  the  denominator  Z 
is  the  sum  over  all  states  which  normal¬ 
izes  the  discrete  density.  The  assump¬ 
tion  of  logistic  noise,  as 


F(y)  =  V(1  +  e"/3y)  ,  -  (D  <  y  <  00  ,  (5) 

gives  the  last  equation  a  particularly 
simple  form  (3) . 

Asymptotic  Temperature 

With  regard  to  (4),  suppose  that  the 
root  mean  intensity  of  the  noise  is  large 
compared  to  each  T^^j  =  t.  Then  the  first 
order  Taylor  series'^  expansion  of  the 
logarithm  is 

log{F(t)/[l-F(t)])  =  4tF'(0) 

since  F(0)=l/2.  Defining  the  asymptotic 
temperature  of  the  network  in  such  a 
manner  that  /3=1/Ta,  in  (3),  we  shall  have 

T,„  =  l/[4f(0)]  (6) 

in  terms  of  the  probability  density  f(y) 

=  F' (y) .  When  (5)  is  assumed,  the  last 
equation  is  indeed  valid  for  ^  =1/T.  If 
the  noise  were  normally  distributed  with 
standard  deviation  a,  then  the  asymptotic 
temperature  would  be 

“^NORMAL  =  (  <ro/2)  ^/2/2. 

If  the  noise  had  a  Cauchy  density  f(y)  = 
(l/n-)c/ (c^+y^)  ,  the  temperature  would  be 

'^CAUCHY  =  ”-c/4. 

Clearly  this  asymptotic  temperature  is 
not  a  function  of  the  mean  noise  inten¬ 
sity,  since  the  variance  of  the  Cauchy 
random  variable  is  undefined. 

SIMULATIONS 

Figure  1  represents  a  Hopfield  not  of 
four  units  in  which  the  labeled  segments 
give  the  dimensionless  strengths  of  the 
symmetric  interconnections.  Let  the  in¬ 
puts  to  units  1  and  2  be  denoted  A  and  B, 
respectively;  and  let  S^  =  C.  We  shall 
consider  only  binary  (0,1)  inputs.  The 
insets  suggest  that  this  small  network 
performs  the  NAND  (Not-AND)  logic  func¬ 
tion  C(AB)  which  the  truth  table  (right) 
defines.  This  would  Indeed  be  the  case 
if  the  network  always  settled  into  the 
state  which  gives  the  global  or  absolute 
minimum  energy.  Table  1  uses  the  formula 

4 

mCs)  =  E 

i=l 

to  assign  a  natural  number  m  to  each  of 
the  16  states  of  the  network;  and  it 
lists  (-1  times)  the  energies  of  the 
states  for  each  input  condi^'on  ABe{00, 
01,10,11).  With  input  AB=11,  the  minimum 
energy  is  -2.5  and  it  occurs  in  state  m=3 
for  which  C  is  zero.  With  the  other  in¬ 
puts,  the  minimum  energy  is  -2.0  and  oc¬ 
curs  in  state  m=12  for  which  C=l.  This 
motivates  the  truth  table  of  Figure  1. 
Figure  2  is  a  state  transition  map  to 
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Figure  1.  A  Hopfield  net  of  four  units, 
two  of  which  receive  binary  inputs  (A  and 
B)  and  one  of  which  registers  the  output 
(C) .  The  T-matrix  is  specified  by  the 
labels  on  the  line  segments  linking  the 
units.  This  net  is  designed  so  that,  for 
given  inputs,  the  global  minimum  energy 
state  gives  a  functional  dependence 
C(A,B)  as  shown  in  the  truth  table  (inset 
upper  right) ,  which  defines  the  Not-AND 
(NAND)  logic  function. 


show  which  transitions  are  allowed. 

Since  units  are  interrogated  in  a  random 
serial  order,  only  one  unit  can  toggle  at 
a  time.  Thus  the  allowed  transitions  are 
of  Hamming  distance  one.  The  16  states 
of  the  "NAND  gate"  correspond  to  squares 
in  the  4x4  array  of  the  map.  The  squares 
are  labeled  with  the  values  m(s) .  Motion 
is  horizontal  or  vertical — never  diagonal 
— between  adjacent  squares.  The  map 
wraps  around  horizontally  and  vertically 
as  indicated  by  the  connecting  lines  and 
arrows. 

The  interaction  map  of  Figure  3  con¬ 
sists  of  four  sub-maps  each  with  the 
structure  of  the  preceding  Figure.  Here 
each  square  is  labeled  with  -Hjj(AB)  = 
-H(8[m] ,D[AB] ) .  The  four  sub-maps  cor¬ 
respond  to  the  four  input  conditions.  If 
the  network  begins  in  state  m=3  with 
AB=ll,  the  energy  is  minimized  and  the 
state  is  stable.  Now  if  the  input 
changes,  the  network  is  unable  to  leave 
the  initial  state,  because  any  allowed 
transition  will  increase  the  energy. 
Similarly,  if  the  initial  state  is  m=12, 
and  the  input  is  subsequently  set  to 
AB=11,  the  state  cannot  assume  the 
desired  value  (m=3)  except  by  way  of  in¬ 
termediate  states  of  higher  energy. 

When  noise  is  injected  into  the  units 
of  the  network,  the  state  can  be  dis¬ 
lodged  from  local  (and  global)  energy 
minima.  The  Boltzmann  distribution  of 
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Figure  2 .  The  state  transition  map  for 
the  network  of  Figure  1  uses  the  indi¬ 
cated  binary-to-decimal  convention  to  as¬ 
sign  an  integer  (0  through  15)  to  each 
state  of  the  net.  Each  square  represents 
a  state.  Allowed  transitions,  which  are 
of  Hamming  distance  one,  correspond  to 
vertical  or  horizontal  motion  from  one 
square  to  one  of  four  adjacent  squares. 


AB  =  00  AB  10 


Figure  3.  The  interaction  (negative 
energy)  maps  for  each  of  the  four  input 
conditions  have  the  same  format  as  Figure 
2;  but  the  squares  are  labled  with  -1 
times  the  computational  energies.  Arrows 
emphasize  the  entrapment  of  the  four  unit 
"NAND  gate"  in  local  energy  minima. 
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Table  1.  Interaction  values  of  the  six¬ 
teen  states  of  the  four  unit  NAND  gate 
indicating  global  maxima  for  each  input 
condition. 


STATE 

AB  = 

00 

10 

01 

11 

0 

0 

0 

0 

0 

1 

0 

1 

0 

1 

2 

0 

0 

1 

1 

(D 

-.5 

1.5 

1.5 

0 

4 

0 

0 

0 

0 

5 

-.5 

.5 

-.5 

.5 

6 

-.5 

-.5 

.5 

.5 

7 

-.5 

.5 

.5 

1.5 

8 

0 

0 

0 

0 

9 

-.5 

0 

-1 

0 

10 

-.5 

-1 

0 

0 

11 

-1.5 

-.5 

-.5 

.5 

12 

2 

2 

_2. 

2 

13 

.5 

1.5 

.5 

1.5 

14 

.5 

.5 

1.5 

1.5 

15 

-  5 

.5 

.5 

1.5 

the  network  state  is  indeed  observed  in 
Monte  Carlo  trials  with  the  network  of 
Figure  1,  to  an  accuracy  consistent  with 
sample  size.  Figure  4  shows  the  results 
of  one  such  test  in  which  999  observa¬ 
tions  of  8  were  recorded  at  random  inter¬ 
vals  in  the  course  of  ten  thousand  itera¬ 
tions  of  the  algorithm  described  above. 
Here  the  input  is  AB=00  so  that  the  modal 
probability  (ie.,  the  probability  of  the 
most  likely  state)  is  p^j  "  Pr(m[s]=12) . 
This  test  used  logistic  noise  with  tem- 


Figure  4.  Theoretical  and  observed  dis¬ 
tributions  of  the  network  state  with 
logistic  noise  at  a  temperature  T=l. 
Sample  distribution  is  based  on  999  ob¬ 
servations. 
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Figure  5.  Modal  probability  versus  tem¬ 
perature  for  the  four  unit  "NAND  gate" 
with  zero  input  using  three  kinds  of 
nc  5  ne . 


perature  T=l. 

When  the  noise  is  not  logistic,  devia¬ 
tions  from  the  Boltzmann  distribution  are 
apparent,  especially  at  lower  tempera¬ 
tures  Figure  5  shows  the  variation  of 
the  modal  probability  with  temperature 
for  each  of  three  noise  distributions. 

ANALYSIS  AND  CONCLUSION 

One  measure  of  the  disparity  of  two 
discrete  probability  densities,  p  and  q, 
is  the  directed  divergence,  or  (Kullback) 
information  for  discrimination  against  p 
in  favor  of  q; 

I(q,P)  =  I^qml°g('Jm/Pm)  • 
m 

It  is  well  known  that,  if  q  is  a  sample 
distribution,  obtained  from  J  independent 
observations  of  a  random  variable  with 
discrete  density  p,  Pj^  >  0  for  all  me 
{0,...M-1),  then  the  product  JI(q,p)  is 
chi-square  with  M-1  degrees  of  freedom  in 
the  limit  M/J — >-0.  Then  the  mean  value 
of  the  product  JI  is  approximately  M-1 
for  large  J;  and  values  of  JI  in  obvious 
excess  of  M-1  will  tend  to  refute  the 
null  hypothesis  p. 

Table  2  shows  the  product  JI  of  the 
sample  size  and  the  discrimination  infor¬ 
mation  with  the  Boltzmann  distribution  as 
the  null  hypothesis.  Each  point  repre¬ 
sents  about  a  thousand  observations  of 
the  state  of  the  four  unit  "NAND  gate"  at 
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random  intervals  in  the  course  of  runs  of 
length  10,000.  Three  different  noise 
distributions  are  considered  with  the  in¬ 
put  AB=00  at  each  of  five  temperatures. 
The  expected  value  of  the  statistic  is  M- 
1  =  15  if  the  null  hypothesis  pertains. 
With  logistic  noise,  the  observations  are 
below  this  criterion  value  in  every  case. 
With  Cauchy  or  with  normal  noise,  the 
null  hypothesis  is  clearly  rejected  at 
T=l/2.  The  case  AB=11  is  considered  in 


Table  2.  Divergence  of  the  N-sample  dis¬ 
tribution  from  the  theoretical  distri- 


bution  of  the 
NAND  gate. 

states  of 

the  four- 

-unit 

inDut= ( 0 , 

,0) 

(1,1) 

T 

LOG  IS 

CAUCHY 

NORMAL 

NORMAL 

2 . 5 

13.3 

18.9 

8.7 

18.9 

2.0 

9.5 

15.6 

18.6 

21.8 

1.5 

10.1 

9.7 

24.0 

13.7 

1.0 

6.8 

11.1 

29.2 

20.9 

0.5 

9.5 

206.1 

68.7 

64.8 

the  right-most  column  normal  noise;  and 
again  the  test  statistic  warrants  rejec¬ 
tion  at  T=l/2.  These  results  might  be 
summarized  by  saying  that  the  asymptotic 
temperatures,  calculated  above  for  non- 
logistic  noises,  are  reasonable  ap¬ 
proximations  when  they  equal  or  exceed 
unit  value. 

APPENDIX 

The  purpose  of  this  Appendix  is  to 
derive  the  transition  matrix  of  the 
Markov  chain  {Sj^,  k=l,2,...).  Let  8  and 
8^  be  the  network  state  as  a  column  and 
row  vector,  respectively.  Let  dj  denote 
a  column  vector  which  has  N  components 
the  i-th  of  which  is  in  terms  of  the 
Kronecker  delta.  Consider  just  the  case 
of  no  input  (O  =  0) .  Then  Xj  =  8^Tdj 
where  T  is  the  connecti.ity  matrix  sub¬ 
ject  to  Hopfield's  restrictions.  The  al¬ 
gorithm  selects  a  j  at  random  and  com¬ 
putes  Sj  =  l[Xu  +  Y],  where  1[.]  is  the 
unit  step  and  Y  has  d.f.  F(y)  and  density 
f(y),  which  is  symmetric  about  y=0. 

We  want  the  probability  of  a  transi¬ 
tion  from  state  s  to  state  s  ds,  where 
ds  =  dj  =  col(6^j)  if  S4=0  and  ds  =  -d.i  = 
col(-4^j)  if  Sj=l.  This  probability, 
denoted'^Q(s+ds]s)  ,  is  proportional  to 
1/N,  the  probability  that  j  is  selected, 
and  is  given  by 


(  (l/N)Pr{Y+X.:  >  0)  if  ds=d^ 
Q(s+ds|s)  =  <  ■' 

( (l/N)Pr{Y+Xj  <  0}  if  ds=-dj 

in  which  Xj  is  determined  by  s  as  noted. 
These  statements  are  the  same  as 

(  (l/N)F(s'^Tdj)  if  ds  =  dj 
Q(s+ds|s)  =  <  ■! 

(  (1/N)  [l-F(s'^Tdj)  ]  if  ds=-dj 

because  of  the  symmetry  of  the  distribu¬ 
tion  of  Y.  For  transitions  of  zero  Ham¬ 
ming  distance  we  shall  have 

Q{s|s)  =  1  -  Q(3+ds|s)  ■ 

ds 

For  transitions  of  distance  more  than 
one,  the  probability  is  zero,  since  the 
algorithm  specifies  that  the  interroga¬ 
tion  of  the  units  is  one-at-a-time. 
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Abstract. 

Traditional  optimization  algorithms  are  not  easily  adap¬ 
ted  to  parallel  computers.  Even  though  the  linear  alge¬ 
bra  operations  can  be  programmed  in  parallel,  the  costs 
associated  with  evaluating  the  objective  function  often 
overwhelm  the  linear  algebra  costs,  and  so  some  paral¬ 
lelism  in  the  function  evaluations  is  essential.  This  pa¬ 
per  describes  how  such  parallelism  can  be  obtained  by 
using  the  block  Lanczos  algorithm  within  a  truncated- 
Newton  method.  This  algorithm  also  admits  parallelism 
in  the  linear  algebra  of  the  algorithm.  The  resulting  algo¬ 
rithms  are  suitable  for  coarse-grained  parallel  computers. 
Details  on  arithmetic  and  communication  costs  are  pro¬ 
vided. 

1.  Introduction. 

This  paper  describes  an  algorithm  for  solving 

minimize /(i)  (1) 

on  a  parallel  computer.  Here  we  assume  that  /(i)  is  a 
smooth  nonlinear  real-valued  function  of  n  variables  *. 

A  method  (or  solving  this  problem  was  given  in  (10). 
It  is  based  on  a  truncated-Newton  method  [2|:  given  some 
initial  guess  xo,  at  each  iteration  a  search  direction  p  is 
computed  by  approximately  solving  the  Newton  equa¬ 
tions  using  a  block  Lanczos  method;  then  a  step  is  taken 
along  that  direction  so  that  the  function  value  decreases 
{®*-H  ^  **  +  ap,  where  /(i*+i)  <  /(z*)). 

As  was  shown  in  (10|,  such  an  approach  can  lead  to  a 
successful  parallel  algorithm.  On  a  number  of  test  prob¬ 
lems,  effective  use  of  parallelism  was  made,  both  in  the 
linear  algebra  operations,  as  well  as  in  parallel  function 
evaluations.  The  purpose  of  this  paper  is  to  analyze  more 
carefully  the  algorithm  used  to  compute  the  search  direc¬ 
tion,  the  block  Lanczos  method.  We  give  here  detailed 
information  on  the  arithmetic  and  communication  costs 
of  that  algorithm.  Related  discussions  can  be  found  in 
113]. 

Here  is  an  outline  of  the  paper:  In  Section  2  we  give  a 
general  discussion  of  the  nonlinear  optimization  method. 
In  Section  3,  we  show  how  parallel  and  vector  computer 
hardware  can  be  used  within  the  block  Lanczos  method, 
and  list  its  costs.  Section  4  contains  our  conclusions. 

Other  approaches  to  parallelism  in  optimization  al¬ 
gorithms  are  available;  see,  for  example,  [1|  and  |7|. 

2.  The  Optimization  Algorithm. 

The  algorithm  we  have  used  to  solve  the  problem  (1)  is  a 
descent  method  based  on  a  line  search.  If  x/,  is  the  current 
approximation  to  a  solution  x‘ ,  then  we  set  zn  *  i  ^  xj,  -I 
ap,  where  p  is  a  local  downhill  (descent)  direction  for 
fix)  at  Xh,  and  a  >  0.  The  scalar  parameter  a  is  chosen 


so  that  /(x/^^j  )  <  f(x^);  techniques  for  computing  a  can 
be  found  in  [4],  Under  mild  assumptions  (see  [3])  this 
algorithm  can  be  shown  to  converge  to  a  point  where  the 
gradient  of  /(z)  is  zero,  i.e.,  the  first-order  conditions  for 
a  minimum  are  satisfied.  Our  main  interest  here  is  the 
computation  of  the  direction  p,  since  this  is  typically  the 
most  expensive  aspect  of  the  optimization  algorithm. 

The  classical  approach  to  this  problem  is  to  use  New¬ 
ton’s  method.  If  we  expand  /(z)  in  a  Taylor  series  about 
xir  we  obtain 

/(^t  +  P)  =  /(xfc)  +  P^9k  +  ^P^GkP  +  O(IIpII') 
«/(**) +  pV  +  |p^G*p 
=  f(^k)  +  Q{p), 

where  gt  =  V/(zt)  is  the  gradient  of  /(z)  at  z*,  and 
Gt  =  V^/(z*)  is  the  Hessian  matrix.  Q{p)  is  a  quadratic 
function  in  p,  and  it  can  be  minimized  by  setting  its 
gradient  with  respect  to  p  equal  to  zero,  resulting  in  a  set 
of  linear  equations  for  p,  called  the  Newton  equations: 

GkP  =  -9k-  (2) 

If  Gk  is  positive  definite,  then  the  solution  of  (2)  corre¬ 
sponds  to  the  minimum  of  Q(p),  and  p  is  used  as  a  search 
direction.  Note  that  for  this  choice  of  p 

/(z*  +  ap)  =  f{xk)  -  \a^9kGk9k, 

so  that  for  small  values  of  a  we  have  /(zj  +Qp)  <  /(z*), 
whenever  g*  0.  Hence  p  is  a  local  downhill  direction 
unless  the  first-order  optimality  conditions  are  satisfied. 
If  G*  is  not  positive  definite,  then  a  “nearby”  positive- 
definite  approximation  to  G*  should  be  used  in  place  of 
Gk  in  (2)  (Sj. 

The  resulting  optimization  method  has  an  asymp¬ 
totic  quadratic  rate  of  convergence,  and  this  rapid  con¬ 
vergence  rate  is  enticing,  but  solving  (2)  can  be  expensive 
for  large-scale  problems,  since  it  involves  computing  the 
matrix  of  second  derivatives  and  solving  a  large  system  of 
linear  equations  at  every  iteration.  As  a  result,  we  have 
chosen  to  use  a  different  technique  to  compute  a  search 
direction. 

Truncated-Newton  methods  are  more  suitable  than 
Newton’s  method  for  the  solution  of  large-scale  optimiza¬ 
tion  problems.  The  search  direction  p  is  computed  as  an 
approximate  solution  of  (2),  obtained  using  an  iterative 
method  for  linear  equations.  Hence,  a  truncated-Newton 
method  is  a  nested  iterative  method:  there  is  an  “outer” 
iteration  for  minimizing  the  function  /(z),  and  an  “in¬ 
ner”  iteration  for  solving  the  Newton  equations  (2).  Here, 
in  order  to  introduce  parallelism  into  the  algorithm,  we 
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have  chosen  to  use  the  block  Lanczos  method,  A  more 
common  choice  is  the  linear  conjugate-gradient  algorithm 
18], 

Truncated-Newton  methods  are  attractive  since  they 
can  be  programmed  to  have  low  storage  and  arithmetic 
costs,  not  require  the  computation  of  the  Hessian  matrix, 
converge  rapidly,  and  be  applicable  to  large  problems. 

Earlier  examples  of  truncated-Newton  methods  ([2], 
|9)),  have  been  useful  on  vector  computers  (17),  but  have 
not  offered  much  scope  for  exploiting  parallel  computers. 
By  using  a  block  Lanczos  method  for  the  inner  iteration, 
parallel  computations  are  introduced,  where  the  degree 
of  parallelism  corresponds  to  the  block  size  chosen,  and 
hence  can  be  adapted  to  the  number  of  processors  avail¬ 
able,  The  block  algorithms  also  retain  a  great  many  vec¬ 
tor  operations,  and  thus  can  be  effective  on  parallel  com¬ 
puters  where  each  processor  has  vector  hardware,  such 
as  the  Alliant  and  Intel  iPSC/2  machines. 

Such  block  methods  for  solving  linear  equations  have 
been  described  in  [11]  and  [13],  The  method  used  here 
is  based  on  the  block  Lanczos  method  jl6j;  this  is  not 
the  most  straightforward  choice,  but  it  permits  the  nu¬ 
merically  stable  treatement  of  non-convex  optimization 
problems  (cf.  [9j).  (If  the  Hessian  is  not  positive  defi¬ 
nite,  the  solution  of  the  Newton  equations  may  not  be 
a  descent  direction;  the  Lanczos  method  allows  the  de¬ 
tection  and  correction  of  this  difficulty.  In  addition,  this 
approach  is  numerically  stable  for  a  non-positive-definite 
system  of  linear  equations,  unlike  its  theoretically  equiv¬ 
alent  partner,  the  linear  conjugate  gradient  method.) 

We  now  provide  the  formulas  for  the  block  Lanczos 
method.  The  algorithm  minimizes  Q(p)  as  a  function  of 
p  over  a  sequence  of  subspaces  of  increasing  dimension. 
A  more  detailed  discussion  of  the  block  Lanczos  method 
can  be  found  in  the  references  cited  above.  The  specific 
formulation  given  here  is  taken  from  [lOj, 

Let  G  be  an  n  X  n  symmetric  matrix.  The  block 
Lanczos  method  with  block-size  m  generates  a  sequence 
of  n  X  rn  orthogonal  matrices  {  V)  }  via: 

Pick  Pj  so  that  V^V\  =  1^-  Set  Pq  =  0„xm> 

~  flm  X  m  ■ 

For  I  =  1,2,.. 

Set 

P,+,/3.+,  =  Gp;  -  p-.a.  V.-rpl,  (3) 

where  a,  =  Vj^GP)  and  the  m  k  m  matrix 

0i^.\  is  chosen  so  that  P^^iPj+i  =  /m- 
P’i  is  computed  as  the  result  of  a  QR  factorization  applied 
to  the  columns  of  the  right-hand  side  in  (3).  The  matrix 
Pj  can  be  obtained  using  a  random- number  generator. 
We  will  assume  that  m  divides  n,  although  this  is  not 
necessary,  and  that  the  algorithm  proceeds  as  above  for 
the  full  n/m  iterations  (see  below  for  a  further  discus¬ 
sion). 

Define  the  block  matrix  Pj^j  --  (PiIPjI  •  |P’,);  if  exact 
arithmetic  were  used  in  the  above  algorithm,  then  we 
would  have  Pj^jP)!)  -  7,  and  PjjjGP),)  =  T(,)  where  T^,)  is 
a  block  tridiagonal  matrix  with  m  x  m  blocr.s: 


7«l  \ 

/?2  02  0 


0  .  0r 

\  0,  a,  / 

A  method  for  solving  (2)  is  obtained  as  follows:  let 
the  first  column  of  P)  to  be  g/  Ugllj,  where  g  is  the  right- 
hand  side  in  (2).  Solve 

T(.,y.  =  e,  =  ( 1,0, . . .  ,u)^' 

for  y,.  Then  p,,  the  t-lh  approximation  to  the  solution  of 
(2),  is  obtained  from  pi  --  l'(,)Pi.  3'his  is  equivalent  to  the 
block  conjugate  gradient  method  in  [11];  both  algorithms 
produce  the  same  estimates  of  the  solution  of  (2),  if  exact 
arithmetic  is  used. 

This  derivation  is  not  suitable  for  computation  since 
the  resulting  algorithm  is  not  iterative.  However,  by 
adapting  the  derivation  in  [15],  an  iterative  method  can 
be  developed.  Assume  now  that  G  is  positive  definite; 
we  will  treat  the  indefinite  case  below.  We  use  Gaussian 
elimination  to  factor  the  block  tridiagonal  matrix: 

T’lO  =  (4) 

where  is  a  block  diagonal  matrix  whose  blocks  are 
themselves  diagonal,  and  is  a  block  lower  bidiagonal 
matrix,  with  blocks  the  same  size  as  in  Define 

t^(.)  =  (5) 

both  and  S(j)  can  be  generated  iteratively.  Then 

P.  =  =  -(V(.)L-.f)(D,-,‘L,-;.Jp-(r)g)  =  (7) 

and  so  an  iterative  algorithm,  referred  to  here  as  the 
block  Lanczos/CG  method,  is  obtained.  The  formulas 
for  the  algorithm  and  their  associated  costs  are  described 
in  more  detail  in  the  next  section. 

Minor  adjustments  to  the  algorithm  are  necessary  if 
(a)  the  algorithm  converges  early,  (b)  m  does  not  divide 
n,  or  (c)  there  is  loss  of  orthogonality  due  to  rounding  er¬ 
rors.  In  such  circumstances,  when  P’,  +  i  is  computed  only 
the  first  mj  <  m  columns  may  be  linearly  independent. 
If  this  happens,  then  i  will  be  an  m  x  mi  matrix  and 
P'i+i  will  be  an  n  X  m,  matrix.  The  remaining  matrices 
in  the  algorithm  will  also  have  to  be  adjusted,  but  the 
formulas  given  above  are  still  valid. 

If  f{x)  is  not  convex,  then  G  may  not  be  positive 
definite  at  every  outer  iteration.  If  this  happens,  then 
at  some  iteration  i  the  LDL^  factorization  of  T(,)  will 
not  be  numerically  stable.  Another  factorization  could 
be  substituted  (see  [15]  and  |13]),  but  since  we  are  more 
interested  in  obtaining  a  descent  direction  than  in  solv¬ 
ing  (2),  alternative  techniques  may  make  more  sense.  It 
would  be  possible  to  u.se  a  modified  matrix  factorization. 
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as  described  in  (4)  and  [9],  or  the  algorithm  could  be 
stopped  at  the  iteration  where  indefiniteness  appears.  Ei¬ 
ther  approach  will  produce  a  descent  direction. 

3.  Parallel  and  Vector  Operations. 

The  block  Lanczos  method  permits  us  to  exploit  parallel 
and  vector  capabilities  in  nearly  every  aspect  of  the  com¬ 
putation  of  a  search  direction.  In  this  section,  we  describe 
in  detail  one  way  of  implementing  the  algorithm,  the  one 
used  in  [10],  showing  the  arithmetic  and  communcation 
costs  associated  with  each  step  of  the  algorithm.  To  sim¬ 
plify  the  discussion,  we  assume  that  the  block  size  m  is 
equal  to  the  number  of  processors.  I’his  is  not  essential 
to  the  algorithm. 

We  shall  consider  each  of  the  steps  of  the  block  Lanc- 
zos/CG  algorithm  in  turn.  We  have  implemented  the  al¬ 
gorithm  on  an  Intel  iPSC/2  which  has  no  global  memory; 
each  processor  has  its  own  local  memory.  We  tacitly  as¬ 
sume  that  that  n  will  be  much  larger  than  m,  although 
the  algorithm  is  valid  without  this  assumption.  Because 
of  this,  each  processor  stores  only  a  small  number  of  vec¬ 
tors  of  length  n  (one  column  of  each  of  the  n  x  m  ma¬ 
trices  V'i,  Vi_i,  GVi,  U,,  Wi,  plus  one  work  vector),  but 
stores  complete  copies  of  the  m  x  m  matrices  a,  /?,  Li, 
and  If  the  number  of  processors  were  large,  and 

hence  m  was  large,  then  other  approaches  would  be  rec¬ 
ommended;  see  the  comments  at  the  end  of  this  section. 

In  the  following  discussion,  we  will  number  the  pro¬ 
cessors  from  1  to  m,  rather  than  the  more  usual  0  to 
m  -  1 . 

1.  The  Lanczos  iteration  —  For  nonlinear  optimization, 
and  particularly  when  the  objective  function  f{x) 
is  expensive  to  evaluation,  this  will  typically  be  the 
most  expensive  step  in  the  method,  and  the  place 
where  effective  use  of  parallelism  will  be  most  essen¬ 
tial.  This  step  involves  m  independent  matrix- vector 
products,  one  for  each  column  of  V',.  If  the  Hessian 
G  is  available,  GVi  can  be  computed  using  tradi¬ 
tional  techniques.  However,  more  often  a  matrix- 
vector  product  will  be  approximated  using  [12| 

g{x  f  hv)  -  g(x) 

Gv^  -  - 

where  g(x)  is  the  gradient  and  h  is  a  finite-difference 
parameter.  Since  g(x)  is  the  right-hand  side  of  (2),  it 
is  already  available,  and  so  a  matrix-vector  product 
can  be  approximated  using  a  single  gradient  evalua¬ 
tion,  and  GV,  can  be  approximated  by  m  indepen¬ 
dent  gradient  evaluations,  one  per  processor.  Thus 
we  can  make  effective  use  of  parallel  gradient  evalu¬ 
ations.  If  the  gradient  were  not  available,  it  could  be 
approximated  using  a  further  level  of  finite  differenc¬ 
ing,  without  losing  the  parallelism  discussed  here. 

•  Communication — This  step  requires  that  the  gra¬ 
dient  be  sent  to  each  processor  (n  real  numbers). 

•  Arithmetic  (per  processor) — One  gradient  evalu¬ 
ation,  two  vector  additions  (2n  operations),  and 
two  vector  scalings  (2n  operations). 

2.  Forming  o,  and  Via,  —  These  matrices  are  computed 


simultaneously  by  sending  the  columns  of  the  matrix 
cyclically  around  the  hypercube,  considered  as  a 
ring.  At  the  j-th  step  of  this  procedure,  processor 
I  computes  (a,)j(  where  )  =  [(f  +  j  ^  1)  rnod  m]  -f 
1.  Processor  I  then  computes  (V'i);(a,)j(.  Note  that 
(V,oci)i  =  5:r=i(»('.);(a.);r 

•  Communication — Each  processor  sends/receives 
m  vectors  of  size  n. 

•  Arithmetic  (per  processor) — Computation  of  the 
two  matrices  requires  2mn  multiplications  and 
2(Tn  -  l)n  additions. 

3.  Forming  V,-]/?/— This  matrix  is  formed  in  the  same 
way  as  above. 

•  Communication — Each  processor  sends/receives 
m  vectors  of  size  n. 

•  Arithmetic  (per  processor)  —  Forming  the  matrix 
requires  jn  multiplications  and  (j  -  1  )n  additions 
on  processor  j.  Since  /?,  is  a  triangular  matrix, 
the  arithmetic  costs  are  slightly  lower  than  be¬ 
fore. 

4.  Forming  the  right-hand  side  in  (3) — involves  m  inde¬ 
pendent  vector  additions. 

•  Communication — None,  after  the  previous  steps 
have  been  completed. 

•  Arithmetic  (per  processor) — 2  vector  additions 
(2n  operations). 

5.  Determining  Vj+i  and  ^i^i  —  consists  of  a  QR  fac¬ 
torization  of  the  right-hand  side  in  (3),  and  can  also 
be  done  in  parallel  |14].  A  modified  Gram-Schmidt 
algorithm  is  used  [6]. 

•  Communication — Processor  j  sends  one  n-vector 
to  n  —J  processors,  and  receives  (j  - 1)  n-vectors. 

•  Arithmetic  (per  processor) — Forming  of  the  fac¬ 
torization  requires  2nj  multiplications,  (2yn- j  - 
n)  additions,  and  one  square  root  on  processor  j. 

6.  Factorization  (4)  of  T(;)— The  matrix  is  block 
lower  bi-diagonal  with  diagonal  blocks  L,,  and  with 
subdiagonal  blocks  Li  i-].  Let  D,  (diagonal)  be  the 
i-th  diagonal  block  of  Df^^y  Then  the  factors  of  T^,) 
can  be  determined  via 

ai  =  LiDi  Lf, 

A  =  fo.i-  I  L),^i  lJ_  , , 

Q,  =  L,DiLJ  4-  L,  i-]  D,_]  Li  ,  i>l. 

These  formulas  correspond  to  LDL^  factorizations 
or  back  substitutions.  These  operations  only  involve 
m  <  m  matrices.  We  have  computed  them  on  a  sin¬ 
gle  processor  since  m  is  small  (at  most  16)  in  our 
case.  They  can  be  performed  simultaneously  on  all 
processors,  as  suggested  in  [13|. 

•  Communication — None. 

•  Arithmetic  (per  processor)-  Ignoring  lower-order 
terms,  formation  of  the  new  factors  costs  m^/2 
multiplications  and  additions. 

7.  Forming  f/^,)  in  (5)  -Write  -  Iffilffj]  ■  ■  IfM  as 
was  done  with  V),).  Then  f/,  ran  be  computed  by 
solving  (via  back  substitution) 

u,lJ.^\\  f/.-iL.,..,,  f/„=-n. 
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The  second  term  on  the  right-hand  side  is  formed  in 
the  same  v/ay  as  in  step  3  above;  combining  the  two 
terms  involves  m  independent  vector  additions.  To 
finish  computing  requires  repeated  back  substitu¬ 
tion  to  solve  for  the  rows  of  Ui. 

•  Communication — In  forming  the  right-hand  side, 
each  processor  sends/ receives  m  vectors  of  size 
n.  To  solve  for  Ui,  processor  j  sends  n  (m  —  j)- 
vectors  and  receives  n  (m  — y  —  1  )-vectors  (except 
processor  1). 

•  Arithmetic  (per  processor) — To  form  the  right- 
hand  side  requires  jn  multiplications  and  addi¬ 
tions  on  processor  j.  Solving  for  Ui  costs  n(m-y) 
multiplications  and  additions. 

8.  Forming  in  (6) — Divide  up  conformally  to 

^(>)-  =  ■  \^J\-  Then 

SI  =  -Dj-'Lf'Vfp; 

note  that  V^g  has  only  one  non-zero  component.  At 
later  iterations,  we  compute  s,  from 

LiUiSi  —  ”  Tt.i -  ]  D;  _  ]  3;  -  ]. 

These  operations  only  involve  vectors  and  matrices 
of  order  m,  and  we  have  chosen  to  do  them  simulta¬ 
neously  on  all  processors. 

•  Communication — None. 

•  Arithmetic  (per  processor) — This  step  requires 

-h  2m  multiplications  and  m(m  -  1 )  additions. 

9.  Forming  P; — Divide  up  (/(;)  and  3(,)  as  above.  Then 
(7)  can  be  written  in  the  form 

Pi  -  Pi  -  1  h  U,Si, 

and  the  right-hand  side  can  be  formed  as  the  linear 
combination  of  m  vectors,  with  in  our  case  one  on 
each  processor.  The  Intel  hypercube  has  a  built-in 
operation  of  this  type. 

•  Communication  -Except  for  processor  1,  each  of 
the  processors  sends  one  n-voctor.  Processor  1 
receives  m  1  n-vcctors. 

•  Arithmetic  (per  processor)  There  arc  n  multi¬ 
plications  per  processor  Processor  1  performs 
mn  additions. 

1(1.  Compute  the  residual  Gp,  1  g  (for  the  convergence 
test)  —The  formulas  for  the  block  Lanezos  algorithm 
give 

Gp,  -  G(/(,)3(,) 

(^'(i)T(,)  i  (0(1  ■  •  ■  3(,) 

where  IF,  GV,  V'ia,  V,  Since 

^(i)^(0.7  !) 

and 

(OO  ■  .U'lA,J's,.,  ll'ih,  ^'s„ 
we  obtain  Gp,  (  g  -  fF,(/,,  ^s,).  The  term  in  paren¬ 


theses  is  computed  simultaneously  on  all  processors, 
and  the  result  is  the  linear  combination  of  n-vectors, 
one  per  processor.  The  more  obvious  formula  for  cal¬ 
culating  the  residual  was  not  used,  to  avoid  an  addi¬ 
tional  matrix- vector  product.  The  resulting  compu¬ 
tations  are  almost  the  same  as  in  the  previous  step. 

•  Communication — As  in  step  9. 

•  Arithmetic  (per  processor) — Each  processor  per¬ 
forms  n  +  m}  12  multiplications  and  m? /2  addi¬ 
tions.  Processor  1  in  addition  performs  n(Tn  -  1) 
additions. 

The  above  discussion  shows  that  all  the  major  steps 
(that  is,  all  the  0(n)  steps)  in  the  block  Lanezos/CG 
algorithm  can  exploit  parallelism.  In  addition,  many  of 
these  steps  correspond  to  basic  linear  algebra  subroutines 
(BLAS);  for  example,  the  inner  products,  linear  combina¬ 
tions  of  vectors,  and  multiplications  of  vectors  by  scalars. 
These  operations  can  be  carried  out  using  vector  hard¬ 
ware  or  assembly-language  instructions  on  many  comput¬ 
ers,  in  particular  the  Intel  hypercubes  and  the  Alliant.  As 
a  result,  this  algorithm  should  be  well  suited  to  parallel 
and  parallel/vector  computers. 

The  description  above  represents  a  column-wise  or¬ 
ganization  of  the  algorithm.  This  is  appropriate  in  this 
application  because  the  matrix- vector  products  are  pro¬ 
duced  one  column  per  processor.  Row-wise  organizations 
are  described  in  [13|,  where  each  processor  stores  a  group 
of  rows  from  each  n  x  m  matrix. 

4.  Conclusions 

We  have  presented  a  truncated-Newton  method  for  min¬ 
imization  of  a  nonlinear  function  suitable  for  a  parallel 
computer.  It  is  based  on  a  block  Lanezos  inner  algorithm 
that  can  exploit  parallel  gradient  evaluations.  We  believe 
that  a  successful  parallel  optimization  algorithm  for  gen¬ 
eral  use  must  be  able  to  use  parallel  function/gradient 
evaluations,  as  this  algorithm  does.  It  should  be  es¬ 
pecially  useful  when  function/gradient  evaluations  are 
costly,  and  when  the  number  of  variables  is  larger  than 
the  number  of  processors  available. 

The  algorithm  is  made  up  of  steps  that  provide  many 
opportunities  for  exploiting  parallelism.  The  costs  of 
these  steps,  both  arithmetic  and  communication,  have 
been  described  in  detail.  In  addition,  the  lower  level  op¬ 
erations  offer  the  possibility  of  further  improvements  in 
performance  when  the  processors  on  the  parallel  com¬ 
puter  in  .addition  have  vector  capabilities. 
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A  TOOL  TO  GENERATE  FORTRAN  PARALLEL  CODE  FOR  THE  INTEL  IPSC/2  HYPERCUBE 

C.  Gonzalez,  J.  Chen,  and  J.  Sarma.,  George  Mason  University 


ABSTRACT 

This  paper  reports  on  a  software  tool  (pre¬ 
compiler)  for  translating  sequential  Fortran  code 
to  parallel  form.  We  investigated  and  implemented 
a  methodology  for  detecting  data  dependencies.  A 
code  generator  was  designed  and  implemented  for 
the  Intel  lPSC/2  hypercube.  This  research 
concentrated  on  parallelizing  do-loop  structures, 
by  dividing  the  data  among  the  nodes.  An  outline 
and  examples  of  the  code  generated  for  the  cube 
manager  and  the  nodes  is  presented.  We  discover 
that  the  use  of  this  precompiler  could 
potentially  be  an  essential  tool  to  use  the 
hypercube  effectively  and  efficiently. 

Key  Words:  Pre-compiler,  software  tool,  Fortran, 
hypercube,  data  dependence,  code  generation, 
supercomputer,  parallelizing  software. 

1.  INTRODUCTION 

Modern  supercomputers  (i.e.  parallel 
computers)  provide  hardware  capabilites  for 
parallel  processing,  but  lack  the  software  tools 
to  support  this  parallelism.  These  supercomputer 
systems  consist  of  a  variety  of  multiprocessors, 
vectors-processors  or  multicomputers 

interconnected  together  in  some  fashion. 
Parallel  computers  are  most  effectively  used  when 
executing  parallel  object  code.  Unfortunately, 
most  compilers  for  such  systems  can  only  process 
sequential  source  code.  The  parallelism  is 
obtained  by  the  use  of  explicit  instructions 
inserted  in  the  code.  This  restriction  requires 
from  the  user  to  explore  and  detect  the 
parallelism  inside  the  problem  and  insert  the 
commands  for  the  concurrent  programming  [Seit- 
85],  This  paper  reports  on  the  design  and 
implementation  of  techniques  for  translating 
sequential  code  to  parallel  code.  Our  long  term 
goal  objectives  is  the  construction  of  a  tool 
that  could  convert  valuable  "old"  sequential  code 
to  run  on  supercomputers. 

Allen  and  Kennedy  [Alle-82]  preprocessed 
FORTRAN  source  code  into  FORTRAN  8x  code  in  three 
steps:  program  normalization,  dependence  testing, 
and  parallel  code  generation.  This  is  the  same 
general  approach  used  in  this  research.  Another 
related  work  is  the  family  of  vectorizers  (KAPs) 
designed  by  Kuck  and  Associates.  Inc  which  use 

This  research  was  supported  by  the  Center  for 
Inovative  technology,  contra. t  No.  SPC'87-005, 
and  by  the  Army  Research  Office,  contract  DAAlOS- 
87  r  0087 . 


loop  interchanging  [Davi-86,  Huso-86,  Mack-86j. 
An  important  difference  with  our  work  is  that  the 
KAPs’  underlaying  machines  (ST-100,  S-1,  and 

Cyber  205)  have  a  tightly  coupled  architecture, 
and  our  work  was  done  for  an  Intel  IPSC/2 

hypercube,  which  is  a  loosely  coupled 
architecture.  Padua  (Padu-86J  made  a 
comprehensive  discussion  on  two  types  of  parallel 
codes  for  compiler  optimization;  vector  and 
concurrent.  We  Combined  the  above  techniques, 
adding  some  source  code  optimization  in  front  of 
the  compiler.  Our  precompiler  assumes  "error- 
free"  FORTRAN  programs  as  input,  and  proceeds  to 
parallelize  the  data  for  the  do  loops  (SIMD 

model),  but  not  the  code. 

2.  SYSTEM  MODEL 

The  functional  decomposition  of  the  model 
used  for  translating  sequential  code  to 
executable  code  has  five  modules:  the  lexical 
analyser,  the  data  dependence  detector,  the 
parallel  code  generator,  the  vectorizer,  and  the 
compiler.  The  |  reduced  final  object  code  is 
composed  of  two  i  fferent  sets  of  code;  one  to  be 
executed  in  the  cube  manager,  and  the  other  to  be 
executed  in  each  node  of  the  hypercube. 


Figure  !  System  Model 


In  this  research  we  design  and  implemented  the 
first  three  modules  of  the  model  described  above 
(i.e.  lexical  analizer,  data  detector  and  code 
generator).  We  used  the  Fortran  compiler  from 
Green  Hill  Software  Inc.,  for  generating 
executable  object  code.  We  did  not  used  the 
available  vectorizer  software  and  hardware  in 
this  project. 

2.1  The  Lexical  Analyzer 

The  lexical  analyzer  translates  Fortran  source 
code  into  a  sequence  of  tokens,  fills  in  a  symbol 
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table,  and  an  array  description  table.  A  BNF  of 
the  simplified  grammar  subset  of  Fortran  that  we 
used,  is  presented  in  (Fig. 2). 

<prog>  ::=  PROGRAM  <id>  {  <arredec>  )n 
<siatement>  STOP  END 

<arredec>  ::=  DIMENSION  <id>  <index>  {  ,  <id>  <index>  ) 
<index>  ::=  (  <integer>  {  ,  <integer>  )n  )  (where  n=2) 
<id>  ::=  <letter>  {  <letter>  |  <digit>  ) 

<letter>  ::=  A  |  ...  |  Z 
<integer>  <digit>  (  <digit>  }n 

<digit>  ::=  0  |  ...  |  9 

<statement>  ::=  {  <dostatement>  |  <simplestatement>  )n 
<dostatement>  ;;=  DO  <label>  <id>  =  <doindex>  ,  <doindex> 
{  ,  <doindex>  )  {  <statement>  )n  <label>  CONTINUE 
<doindex>  ::=  <id>  J  <inieger> 

<!abel>  :;=  <digit>  {  <digit>  )n  (where  n=4) 

<simplestatement>  A  string  of  characters  not 

having  a  DO  as  first  characters. 

Figure. 2  Subset  of  FORTRAN  Gr.immar 

2.2  Data  Dependence  Detector  (DDD) 

The  DDD  performs  semantic  analysis  uf  the  code 
to  check  for  parallel  do-loops.  The  input  for 
this  module  are  the  tokens,  symbol  table,  and 
array  dependence  table  generated  by  the  Lexical 
Analyzer.  The  semantic  information  is  stored  in  a 
dependence  table.  This  information  includes:  the 
line  number  where  the  read-,  write-,  and  do- 
statements  were  found;  also  included  is 
information  about  the  variables  and  arrays  on  the 
left  and  right  hand  side  of  the  corresponding 
statements.  The  DDD  al.so  outputs  the  array 
indexes  and  loop  control  variables,  which  are 
especially  important  for  multi-dimension  and 
multi-level  do-stalements. 


2.3  Code  Generator 

The  code  generator  makes  use  of  information 

generated  by  the  Lexical  Analyzer,  and  the  data 
depe.dence  tables  generated  by  the  DDD  (see 

figure  3.).  If  the  current  line  generated  is  a 

no-parallelizable  statement  (i.e  with  not  data 

dependence  implications),  the  code  generator 

simply  gets  the  information  directly  from  the 
lexical  analyzer  output  (step  I  I.  If  the  current 

statement  analyzed  is  a  read-,  write  ,  or  do- 

stalement,  the  code  generator  uses  information 
from  both  the  lexical  analyzer  and  the  depedcncc 
detector  (steps  I  and  2).  The  next  step 

synthesizes  the  information  and  writes  it  to  a 

buffer.  A  final  step  produces  the  source  for  the 

host  (file  name:  hosl.f)  and  the  source  for  the 

node  (file  name:  node.f). 


Figure. 3  Code  Generato.'  and  Tables 


2.4  Vcclorizer 

The  propose  of  the  vectorizer  software  is  to 
generate  code  that  will  use  the  vectorizer 
hardware  board.  The  code  produced  by  the  code 
generator  could  be  the  input  for  this  software. 
Hence,  it  also  does  data  dependence  checking  and 
modifies  the  '-ode  by  adding  vector  calls.  The 
vector  calls  are  supported  by  vector  library  and 
a  vector  processor  attached  to  each  node.  The 
available  software  vectorizer  is  VAST-2  which 
according  to  user  directives,  changes  program  to 
expose  array  operation  (Fig. 4). 


I 

I 

I 

Uirf 


DO  20  1=1, N 
S=0.0 

DO  10  J=I,N 
S=S+A(I,J)*X(J) 

10  CONTINUE 
Y(1)=S 

20  CONTINUE 
source  code  vector  call  ouipiii 

Figurc.4  VAST-2  Program  Development  Sequence 


2.5  FORTRAN  Compiler 

The  iPSC/2  FORTRAN  compiler  used  for  this 
project  was  from  Green  Hills  Software,  Inc  The 
files  hosl.f  and  node.f  were  compiled  and  linked 
to  produce  files:  host  and  node,  which  are 
executable  code. 

3.  DETECTION  OF  DATA  DEPENDENCIES 

The  data  dependencies  among  the  statements 
were  the  deciding  factors  whether  the  "do  loop" 
could  be  processed  in  parallel  or  not 


DO  20  1=1. N 

Y(I)=DD0T(N,A(I.I),LNA,X(1).I) 
20  CONTINUE 
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3.1  Basic  Assumptions 

We  made  certain  assumptions  in  order  to 
implement  the  data  dependence  analyzer.  These 

assumptions  were  necessary  so  that  we  could 
handle  simple  loops  before  we  added  more 
complexities  to  it.  The  assumptions  made  were  as 
follows: 

1)  There  was  only  one  level  of  do  loops,  i.e. 

no  nesting  of  do  loops  was  considered. 

2)  There  were  no  equivalence  statements  in  the 

source  program. 

3)  The  do  loop  was  a  very  simple  one  (i.e.  only 

arithmetic  operations  were  performed  inside 

the  loop).  There  were  no  logical  statements 

inside  the  loop  (i.e.  no  transfer  of  flow 
statements),  for  which  more  complex  analysis 
would  be  required. 

4)  The  array  indices  were  not  greater  than  two. 

The  DDD  can  be  easily  extended  to  include 
indices  greater  than  two,  without  much 

problem. 

3.2  Types  of  Dependencies 

Data  dependence  relations  between  two 
statements  determine  if  they  can  be  executed  in 
parallel.  There  are  different  types  of 

dependencies  between  statements  (Padu-86). 

a)  Flow  dependence:  can  exist  between  two 

statements  SI  and  S2,  if  the  data  value  in 
SI  is  used  in  S2.  Since  statement  S2  needs 
the  value  from  SI,  it  cannot  be  executed 
unless  statement  SI  has  finished  executing. 
The  following  statements  are  an  example  of 
this  type  of  dependence. 

51  :  A(l)  =  B(I)  +  C(l) 

52  :  D(l)  =  Ad)  *  3 

b)  Antidependence:  exists 

statements  SI  and  S2,  if 
uses  a  variable  which  is 
value  in  statement  S2. 
statements  are  an  example 
dependence. 

SI:  A(l)  =  B(l)  +  C(I) 

S2  :  B(l)  =  D(l)  •  3 

As  can  be  seen  from  this  example  the  two 
statements  SI  and  S2  cannot  be  executed  in 
parallel  as  SI  uses  the  old  value  of  B(I) 
which  is  later  assigned  a  new  value  in  S2. 

c)  Output  dependence:  between  two  statements 

can  exist  if  a  variable  which  is 

assigned  a  value  in  one  statement  and  is 
later  assigned  a  new  value  in  another 

statement.  The  following  statements  are  an 


example  of  this  type  of  dependence. 

51  :  A(I)  =  B(I)  +  C(I) 

52  :  D(I)  =  Ad)  •  3 

53  :  Ad)  =  E(I)  +  F(I) 

Statement  SI  will  contain  a  wrong  value  in 
A(I)  if  it  is  executed  after  statement  S3. 
These  statements  have  to  be  executed  in  the 
sequence  they  appear  so  that  all  the  left 
hand  side  variables  contain  the  correct 
value. 

d)  Control  dependence:  is  the  dependence  which 
occurs  from  an  "if"  statement  to  the 

statements  which  are  within  the  "if" 

statement  block. 

In  the  implementation  of  the  precompiler,  we 
considered  only  the  first  three  types  of 

dependencies.  Control  dependency  was  not 

analyzed  because  of  the  assumption  that  there 
were  no  logical  statements  inside  the  do  loop. 

3.3  Direction  of  the  Dependencies 

The  direction  of  the  data  dependence  relations 
also  has  to  be  analyzed  inside  the  do  loop.  The 

data  dependencies  inside  the  do  loop  is  found  by 

analyzing  the  arrays  and  their  subscripts.  The 
following  are  the  types  of  data  dependence 

direction: 

a)  Equal  flow  dependence: 

DO  100  I  =  1,  K 
SI:  A(I)  =  B(I)  +  C(l) 

S2  :  D(l)  =  A(l)  *  3 

100  CONTINUE 

There  is  flow  dependence  between  statements 
SI  and  S2,  but  this  dependence  relation 

stays  within  the  same  iteration  of  the  do 

loop.  By  which  we  mean  that  for  any 

iteration,  the  value  assigned  to  A(I)  in 

statement  SI  is  used  by  statement  S2  in  the 
same  iteration.  Therefore  we  can  say  that 

there  exists  equal  flow  dependence  between 

SI  and  S2. 

b)  Less  than  flow  dependence: 

DO  100  1  =  2,  K 

51  :  A(I)  =  B(I)  +  C(l) 

52  .  D(I)  =  A(l-I)  •  3 

100  CONTINUE 


Statement 

S2  uses 

a 

value  of 

the  array 

variable 

A  which 

was 

assigned 

during  the 

previous 

iteration 

of 

the  do  loop,  i.e.  it 

uses  an 

old  value 

of 

the  array 

\  a  r  i  a  b  1  e  A . 

between  two 
SI 

assigned  a  new 
The  following 
of  this  type  of 
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The  flow  dependence  does  not  stay  within  the 
same  iteration  instead  it  flows  from 
iteration  i-l  to  iteration  i. 

c)  Less  than  antidependence: 

DO  100  I  =  1,  K-1 
SI  :  A(I)  =  B(I)  +  C(I) 

S2:  D(l)  =  A(I+1)  •  3 

100  CONTINUE 

Here  statement  S2  uses  an  old  value  of  the 

array  variable  A  which  is  assigned  a  new 
value  in  the  next  iteration  by  statement  SI. 
Since  S2  uses  an  old  value  there  exists 
antidependence  relation  between  the  two 
statements  SI  and  S2.  The  dependency  flow  is 
from  iteration  i  to  iteration  i+1. 

The  DO  loop  is  parallelizahle  only  if  the 
statements  inside  the  do  loop  block  have  an 
equal  flow  dependence  relation, 

3.4  Semantic  Information 

The  data  dependency  analyzer  generates  the 
information  needed  by  the  code  generator.  The 

information  required  was  the  line  number  of  the 
"do  loop".  Special  treatment  was  required  if 
there  were  any  "read",  "write",  or  "print" 

statements  in  the  source  program.  These 
statements  had  to  be  processed  by  the  cube 
manager,  because  the  nodes  can  not  access  files. 
The  dependency  table  was  implemented  by  using 

arrays.  The  four  statements;  ’read’,  'write', 
'print',  and  'do  loop',  are  assigned  integer 

values  (0  to  3)  and  this  is  stored  in  the  data 

dependency  table,  along  with  the  line  number  of 
the  statement  in  the  source  program.  In 

addition  information  about  the  variables  on  the 
right  hand  side  and  left  hand  side  of  the 
assignment  statement  are  also  stored  in  arrays. 

4. CODEGENERATION 

The  code  generator  produces  parallel  do  loops 
for  the  nodes.  The  following  sections  describe 
some  key  issues,  such  as  the  formal  used  by  our 

model,  communication  overhead,  buffer  size,  and 
work  load  for  each  node. 

4.1  Communication  Between  Host  and  Nodes 

A  typical  iPSC/2  hardware  configuration  is 
shown  in  Fig. 5.  The  SRM  functions  are:  support 

for  program  development,  cube  management,  I/O 
interface,  and  gateway  to  host  machines.  The  .SRM 
hardware  consists  of  a  processor  mother  board  (16 
MHz  80386,  80387  coprocessor,  and  console 
terminal  port),  8  Mbytes  of  32-bit  RAM.  a  Of M 
board,  and  an  Ethernet  TCP/IP  communications 


board.  Each  node  has  a  pair  of  16  MHz  80386  and 
80387  coprocessors,  1-8  Mbytes  RAM,  and  a  DCM 
board.  The  nodes  communicate  through  message 
passing.  The  topology  of  the  network  is  a 
hypercube.  The  cube  we  worked  on  has  16  nodes 
numbered  0  to  15. 


Figure  5.  ISPC/2  configuration. 


Communication  Routines: 

Our  model  uses  the  routines  csend  and  crecv  to 
communicate  between  the  host  and  the  nodes. 

a)  csend(MSGTYPE,  BUF,  MSGLEN,  NODEID,  NODEPID) 
Sends  a  message  between  the  nodes  and  the 
host,  and  waits  until  the  whole  message  goes 
out. 

MSGTYPE  is  the  type  of  message.  Used  as 
message  identifier.  --  BUF  is  a  one 
dimension  array  of  integers  or  reals, 
containing  the  message  sent  out. 

MSGLEN  is  the  number  of  bytes  in  BUF  (from  1 
to  MSGLEN)  that  will  be  sent  out. 

NODEID  is  the  destination  node/host  id. 

NODEPID  is  the  process  id  at  the  destination 
node/host. 

It  is  important  to  point  out  that  the  network 
path  for  communication  is  handled  at  the 
operating  system  level,  thus,  hidden  at  the 
PORTRAN  level.  Because  of  the  hypercube  topology 
we  know  the  paths  followed  by  each  message,  and 
we  could  use  this  information  to  minimize 
communication  delays. 

b)crccv(MSGTYPE,  BUF.  MSGLEN) 

Receives  the  message  from  other  nodes  or 
from  the  host,  and  waits  until  the  whole 
message  is  received. 

MSGTYPE  is  the  type  of  message,  used  as 
message  identifier.  If  the  MSGTYPE  matches 
with  that  of  the  csend,  the  message  arrives 
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at  its  destination, 

BUF  is  one  dimension  array  of  integer  or 

real,  containing  the  message  received. 

MSGLEN  is  the  upper  bound  number  of  bytes  in 
BUF  (from  1  to  MSGLEN)  will  be  received. 

Following  is  an  example  for  sending  the  first 
100  real  elements  of  an  array  from  the  host  to 
all  the  nodes. 

host;  REAL*4  BUFFOUT(2000),  A(IOOO) 

INTEGER*4  TYPEOUT,  ALLNODES,  NPID,  LENOUT 
DATA  ALLNODES  /-!/,  NPID  /!/,  TYPEOUT  /!/ 
LENOUT  =  400 
DO  301  I  =  1,  100 
BUFFOUT(l)  =  A(l) 

301  CONTINUE 

CALL  CSEND(TYPEOUT,  BUFFOUT,  LENOUT, 
ALLNODES,  NPID) 


END 

node:  REAL*4  BUFFIN(2000),  A(IOOO) 

INTEGER*4  TYPEIN,  HOST,  LENIN 
DATA  TYPEIN  /!/,  NPID  /!/ 

HOST  =  MYHOSTO 
LENIN  =  4000 

CALL  CRECV(TYPEIN,  BUFFIN,  LENIN) 
DO  301  I  =  1,  100 
A(I)  =  BUFFIN(l) 

301  CONTINUE 


END 

To  reduce  communication  time,  we  made  the 
message  transferred  as  long  as  possible,  instead 
of  passing  several  short  messages. 

4.2  Disk  I/O 

In  the  iPSC/2  only  the  host  can  do  a  read  or  a 
write  to  disk.  For  a  read-statement  in  the  source 
code,  we  generate  the  following  code  (Fig. 7).  For 
a  write-statement  in  the  source  code,  we  just 
simply  copy  the  statement  to  the  host. 

4.3  The  Host 

The  host  manages  the  computations  of  each 
node,  handles  I/O,  and  communicates  with  other 
hosts.  Our  host  code  has  features  for  supporting 
the  above  functions.  For  example,  we  set  the  do- 
loop  control  variables  for  handling  workload  of 
each  node  at  run  time.  The  host  sends  messages  to 
all  nodes  concurrently,  and  waits  to  receive  the 
results  from  all  the  nodes  (Fig. 8),  A  set  of  read 
statements  in  the  source  code  will  generate  the 


source:  READ(I,  110)  N1 

READ(1,  111)  (A(I),  1  =  1,  Nl) 


END 

host;  READ(I,  IIO)  Nl 

READ(1,  III)  (A(I),  I  =  I,  Nl) 

LENOUT  =  Nl  ♦  4  +  I  ♦  4 
BUFFOUT(l)  =  Nl 
DO  301  I  =  1,  Nl 
BUFFOUT(I)  =  A(I) 

301  CONTINUE 

CALL  CSEND(TYPEOUT,  BUFFOUT,  LENOUT, 
ALLNODES,  NPID) 


END 

node:  LENIN  =  2000  •4  +  1*4 

CALL  CRECV(TYPEIN,  BUFFIN,  LENIN) 
Nl  =  BUFFIN(I) 

DO  301  I  =  2,  Nl  +  I 
A(l)  =  BUFFIN(i) 

301  CONTINUE 


END 

Figure. 7  Read-Statements 

code  described  in  figure  7.  This  code  will  be 
inserted  whenever  the  read  statement  is 
recognized  by  the  code  generator. 

4.4  The  Nodes 

All  nodes  will  run  concurrently  the  same  copy 
of  the  node  program,  but  they  will  execute  on 
different  data  (SIMD  model).  We  use  message 
passing  between  host  and  nodes,  but  not  node  to 

node.  The  message  passing  routines,  csend  and 
creev  were  used  to  synchronizes  the  operations 
between  host  and  nodes.  Each  node  receives  all 
the  data.  The  data  was  not  partitioned  (i.e.,  all 
nodes  get  all  the  data)  in  order  to  maintain  the 
simplicity  of  the  code  generated.  Part  of  the 
continuation  of  this  project  will  be  the  analysis 

of  sending  to  every  node  only  the  data  it  will 

need  to  perform  its  computation. 

Each  node  receives  from  the  host  the  right 
hand  side  values  of  statements  inside  the  do 
loop,  and  the  do  loop  control  values.  It  then 
calculates  its  own  "ceiling”,  or  upper  bound  of 
iterations  for  the  loop.  Then.  calculates  its 
inloop  (initial  value  of  the  loop)  and  endloop 
(final  value  of  the  loop)  values.  Hence, 

different  nodes  will  have  different  endloop  and 
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inloop  values.  After  each  do  loop  is  completed  in 
the  node,  its  results  (i.e.  the  left  hand  side 
values)  and  the  loop  control  values  are  sent  back 
to  the  host.  When  the  host  receives  these  values, 
proceeds  to  load  the  values  into  its 
corresponding  destinations. 

PROGRAM  header, 
buffer  declaration. 

normal  array  and  variable  declaration, 
special  constant  declaration, 
special  variable  declaration, 
equivalence-statements  for  control  variables 

and  buffers. 

data-statement  for  initializing  special  constants. 

CALL  SETPIEKHOSTPID). 

NNODES  =  NUMNODESO. 

CALL  LOAD  (’node’,  ALLNODES,  NODEPID). 

OPEN  data  files, 
assign  control  variables. 

compute  message  length  (LENOUT)  from  control  variables, 
load  output  buffer  (BUFFOUT)  with  data  from  right 
hand  side  of  do-loop. 

CALL  CSEND(TYPEOUT,  BUFFOUT,  LENOUT, 

ALLNODES,  NODEPID). 

compute  upper  bound  length  (LENIN)  of  incoming 
message  from  each  node. 

DO  768  INODE  =  I,  NNODES. 

CALL  CRECV(TYPEIN,  BUFFIN,  LENIN), 
move  message  to  destination  array. 

768  CONTINUE, 
rest  of  the  code. 

CALL  KILLCUBE(ALLNODES,  NODEPID). 

CLOSE  data  files. 

STOP. 

END. 

Figure. 8  Host  Program  Outline 


5.  IMPLEMENTATION 

The  precompiler  was  developed  using 
FORTRAN/VMS/VAX  8800.  Then  it  was  ported  to  the 
IPSC/2.  This  was  done  because  compared  with  the 
VAX  8800,  SRM  is  single  user,  and  slower  for 
program  development.  A  complete  set  of  examples 
and  the  source  code  for  the  Lexical  Analizer, 
DDD,  and  Code  Generator  can  be  found  in 
[Gonz-88]. 

6.  CONCLUSIONS  AND  FUTURE  RESEARCH 

Many  programs  have  been  written  in  sequential 
FORTRAN.  A  precompiler  that  generates  source  code 
in  parallel  form  can  re-use  most  of  the  "old" 
FORTRAN  programs  to  run  on  a  supercomputer 
without  redesigning  and  rewriting  them.  Some  key 


factors  which  complicates  this  project  are  the 
number  of  nested  do  loop  levels  and  any  operation 
done  with  the  indexes  of  the  arrays. 

We  are  working  on  a  tutorial  aid  for  directing 
FORTRAN  programmer  while  using  our  FORTRAN  pre¬ 
compiler  to  generate  concurrent  program.  We  plan 
to  work  on  a  model  that  supports  stochastic  loop 
assignment.  This  model  will  require  flexible 
formats,  and  node  to  node  communication.  Another 
future  research  will  consider  complex  statements, 
such  as  equivalence-  and  if-statements.  A  final 
future  research  is  the  implementation  of  the  MIMD 
model  in  which  the  program  is  divided  into 
segments  (either  subroutines  or  functions), 
download  each  of  them  to  different  node,  and  run 
them  with  their  different  data.  We  will  include 
in  all  of  our  future  models  performance 
measurement  and  comparison  with  other  models 
(hardware  and  software). 
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MULTIPLY  TWISTED  N-CUBES  FOR  PARALLEL  COMPUTING 


T.-H.  Shiau,  Paul  Blackwell  and  Kemal  Efe.  University  of  Missouri-Columbia 


Abstract  It  is  known  that  by  twisting  one  pair  of  edges 
of  the  N  dimensional  cube,  the  resulting  graph 
denoted  by  TQ(N)  has  diameter  N-1  instead  of  N.  In 
this  work,  we  show  that  by  twisting  multiple  pairs  of 
edges  as  well  as  pairs  of  buses  (a  bus  is  defined  as  a 
set  of  edges  with  certain  common  properties),  the 
diameter  becomes  r2N/3l.  The  resulting  muLiply 
twisted  N-cube,  denoted  by  MTQ(N),  preserves  most 
of  the  desirable  topological  properties  of  the  ordinary 
N-cube  for  parallel  computing.  A  simple  routing 
method  is  presented  which  can  easily  be 
implemented.  Finally  we  discuss  generalizations  of 
MTQ(N)  for  which  the  diameters  can  be  made  even 
smaller  at  the  expense  of  more  complicated  routing. 
The  smallest  diameter  which  can  be  achieved  by  this 
approach  is  r(N+1)/2l. 

iCE'i  WORDS;  Interconnection  networks,  Flypercube, 
Parallel  processing 

1.  INTRODUCTION 

An  n-dimensional  hypercube  Q(n)  =  (V,E)  is  the 
graph  with  N=2^  nodes  each  of  which  can  be  labeled 
by  a  unique  n-bit  binary  number  such  that  two  nodes 
are  adjacent  if  and  only  if  their  labels  differ  in  exactly 
one  bit  position.  The  graph  0(3)  is  depicted  in  Figure 
1. 

Many  multiprocessor  computer  systems  use  the 
hypercube  as  the  interconnection  network,  i.e.  each 


Fiaum  1 .  0(3)  drawn  in  two  different  ways 
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node  of  Q(n)  is  a  processing  element,  usually  with 
local  memory,  and  the  edges  of  Q(n)  are  the  physical 
communication  links.  For  example,  the  Cosmic  Cube 
in  Seitz  (1985),  iPSC  of  Intel  Corporation  (1985), 
NCUBE/10  of  NCUBE  Corporation  (1986)  and  the 
Connection  Machine  in  Hillis  (1985)  are  all 
hypercube  parallel  computers,  although  the  scales 
and  granularities  of  parallelism  of  those  computers 
very  widely. 

The  popularity  of  the  hypercube  for 
interconnection  networks  stems  from  many  of  its  nice 
topological  properties.  To  name  a  few,  the  graph  is 
regular^  that  is,  each  node  has  the  same  number  n  of 
adjacent  nodes,  has  relatively  small  diameter  which 
grows  only  logarithmically  with  respect  to  the  total 
number  of  nodes,  and  has  large  minimum-bisection 
width  (MBW)  N/2.  The  MBW  is  the  minimum  number 
of  edges  which  must  be  removed  from  the  graph  to 
separate  it  into  two  disconnected  graphs  with  equal 
numbers  of  nodes  (or  different  by  1  if  the  total  number 
is  odd).  Small  MBW  implies  severe  limitations  of 
parallel  data  routing  between  two  parts  of  the  system, 
while  large  diameter  would  mean  large  propagation 
delay  in  communication.  Other  properties  of  Q(n)  can 
be  found  in  Erdos  and  Spencer  (1979),  Folds  (1977), 
Hart  (1976),  Mulder  (1980),  and  Saad  and  Schultz 
(1985). 

Although  Q(n)  has  many  desirable  properties,  it  is 
shown  in  Esfahanian,  Ni  and  Sagan  (1987)  that  the 
diameter  can  be  reduced  by  1  by  twisting  any  single 
pair  of  edges  in  any  shortest  cycle.  For  example 
Figure  2  shows  the  twisted  cube  with  diameter  2.  The 
twisted  n-cube  denoted  by  TQ(n),  preserves  most  of 


Figure  2.  TQ(3)  drawn  in  two  different  ways 
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the  nice  properties  of  Q(n).  In  addition,  it  contains  the 
node  complete  binary  tree  as  a  subgraph  which 
is  not  a  subgraph  of  Q(n).  Independently,  Blackwell 
etal.  (1988)  show  that  by  properly  twisting  pairs  of 
bundles  of  edges  as  a  whole  and  pairs  of  edges 
within  the  bundles,  the  diameter  can  be  reduced  to 
r{n+1)/2l  and  most  of  those  properties  are  still 
retained.  This  reduces  by  almost  fifty  percent  the 
diameter  of  Q(n). 

Although  the  fifty  percent  reduction  of  the  diameter 
provides  the  potential  for  the  same  amount  of 
reduction  of  the  propogation  delay  of  interprocessor 
communications,  the  routing  is  more  complicated 
which  would  offset  some  of  the  advantages  in 
practical  applications.  In  this  work,  we  show  a  much 
simpler  way  to  construct  a  class  of  twisted 
hypercubes  with  diameter  r2n/3l  for  which  a  simple 
routing  method  exists. 

In  general,  we  can  construct  twisted  cubes  with 
diameter  r(2+k)  n  /  (3+2k  )1.  The  greater  tiie  k,  the 
more  complicated  the  graph  and  the  routing  ,  and  the 
closer  the  diameter  is  to  the  r(n+1)/2l  of  the  graph 
in  Blackwell  et  al.  (1988)]. 


-Figure  3  MTQ(3k+3) 


2  DEFINITION  OF  THE  MULTIPLY  TWISTED 
N-CUBES 

Definition:  A  recursive  definition  is  given  as  follows 
for  multiply  twisted  n-cubes,  MTQ(n),  with  diameter 

r2n/3l. 

0.  MTQ(O)  =  0(0)  which  consists  of  a  single  node. 

1.  MTQ(3k+1)  for  k>0.  consists  of  two  copies  of 
MTQ(3k).  Gq  and  G^,  and  2^^  (=|Oq|  )  additional 

edges,  called  level  3k  links,  between  Gq  and  G^ 
which  defines  an  isomorphism  /3k:  Gq  ->  Gi  by  v-i  = 
/3k(vo)  if  and  only  if  (vq,  v-;)  is  a  level  3k  link.  In 
short.  MTQ(3k+1)  is  constructed  by  linking  two 
MTQ(3k)  by  a  "straight  bus"  of  2^''  lines.  The  straight 
bus,  in  contrast  to  the  twisted  bus  of  Blackwell  et  al. 
(1988),  makes  the  topology  and  routing  very  simple. 

2.  MTO(3k+2)  is  similarly  defined  by  two  copies  of 
MTQ(3k+1)  and  a  straight  bus  of  23*'+^  levei  -(3k+1) 
links  (edges). 

3.  MTQ(3(k+1 ))  consists  of  two  copies  of  MTQ(3k+2) 
and  a  twisted  bus  of  level  3k+2  links  between  them 
such  that  the  eight  copies  of  MTQ(3k)  form  a  TQ(3), 
see  Figure  3.  More  specifically  MTQ(3k+3)  =  TQ(3)  x 
MTQ(3k). 

3.  THE  DIAMETER  AND  ROUTING 
Theorem  h  The  diameter  of  MTQ(n)  is  r2n/3l.  The 
diameters  of  MTQ(3k+2)  and  MTQ(3k+3)  are  both 
2k+2. 

Proof:  Induction  on  n. 

Case  1 .  n=3k+1 .  Let  Gg  and  G^  be  the  two  copies 
of  MTQ(3k)  in  MTQ(n).  Given  any  two  nodes  u,v,  in 
MTQ(n),  if  they  belong  to  the  same  copy  of  MTQ(3k), 
d(u,v)<  [(2/3)  3kl  =  2k  by  the  induction  hypothesis. 


Otherwise,  assume  ueGg,  veG^  and  let  u’= /3k(u) 
e  Gi  where  /3k  is  the  isomorphism.  Then  d(u,v)  < 
d(u,u’)  +  d(u',v)  s  1  +2k. 

So  the  diameter  of  MTQ(n)  <  r2n/3l  .  To  show 
that  equality  holds,  let  again  u  eGg  .  v  eGi  .but 
also  dQ(u’,v)=2k.  Let  m=d(u,v),  it  suffices  to  show 
m=2k+1.  Let  p=(u=sg,  s^  ,..,Sm=v)  be  any 
shortest  path  between  them.  There  must  be  a 
level  3k  link  (Sj,  Sj^.))  for  some  i.  0<i<m.  By 

isomorphism 

P  =(U  =Sg  .S')  .....Sj  =Sj^)  ,Sj^2' 
where  Sj’=/3k(Sj).  is  a  shortest  path  with  length  m-1 

between  u'  and  v.  So  2k=m-1, 

Case  2  .  n=3k+2.  Similar  to  the  previous  case,  we 
can  show  that  diameter(MTQ(n))  =  2k'f2  =  [20/31. 

Case  3  .  n=3k+3.  Because  of  the  twisting  of  the 
two  buses,  we  can  go  from  any  copy  of  MTQ(3k)  to 
another  by  no  more  than  two  links.  So 
diameter(MTQ(3k-»-3))  <  2+diamefer(MTQ(3k))  =  2  + 
2k.  By  similar  argument  as  in  Case  1,  the  equality 
holds  again.  Q.E.D. 

4.  THE  ROUTING  METHOD 

Routing  on  TQ(3)  is  straightforward  since  it 
consists  of  only  eight  nodes.  The  routing  algorithm 
can  either  be  directly  hard-wired  or  by  a  lookup  table 
of  eight  entries  showing  the  outgoing  link  for  each 
destination. The  routing  of  MTO(n)  is  simply  a 
multi-level  TQ(3)  routing.  Using  the  n-bit  binary 
number  labeling  with  the  less  significant  bits  for  lower 
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level  links,  we  can  carry  out  the  routing  either 
bottom-up  or  top-down.  By  bottom  up,  we  try  to 
correct  earlier  the  less  significant  bits,  i.e.  route  the 
package  to  the  intermediate  node  of  which  the  label 
has  the  same  less  significant  bits  as  that  of  the 
destination.  Note  that  every  three  bits  can  be 
corrected  in  two  steps  by  applying  the  same  lookup 
table  as  that  for  TQ{3).  So  after  r2n/3l  steps  all  the 
bits  are  correct.  The  top-down  routing  corrects  the 
most  significant  bits  first.  The  detail  is  omitted. 

5.  GENERALIZATION 

Using  basic  modules  other  than  TQ(3),  we  can 
construct  different  multiply  twisted  hypercubes  with 
even  smaller  diameters.  In  Blackwell  et  al.  (1988),  a 
family  of  twisted  cubes  is  given  with  diameter 
r(n+1)/2l  where  n  is  the  dimension.  For  n=3,  the 
graph  is  the  same  as  TQ(n).  To  make  this  paper  more 
self-contained,  we  shall  describe  the  case  for  n=5 
and  thereby  construct  the  multiply  twisted  hypercubes 
using  it  as  the  basic  module. 

We  shall  use  the  same  notation  TQ(n)  for  the  new 
famiiy  of  twisted  cubes. 

Definition: 

(1 )  For  n=2,  TQ(n)  is  the  same  as  Q(n). 

(2)  For  n=3,  TQ(n)  is  as  in  Figure  2. 

(3)  For  n=4,  TQ(4)  is  constructed  from  four  copies  of 
TQ{2),  denoted  by  Gqq,  Gq-|,  G-iq  and  G-| -i , 
connected  by  four  buses  of  width  four  as  in  Figure  4. 
Each  bus  is  twisted  so  that  the  bus  and  the  two 
copies  of  TQ(2)  at  its  ends  form  a  TQ(3). 

(4)  For  n=5,  TQ(5)  is  constructed  from  eight  copies  of 

TQ(2),  Gjj  b  b  ^i=®  ^  for  0  <  i  <  2.  connected  by 

2  10 

(twelve)  twisted  buses  of  width  four  as  shown  in 
Figure  5.  Again,  each  bus  is  twisted  so  that  the  two 
copies  of  TQ(2)  and  the  bus  form  a  TQ(3). 


<1ik  T0(2) 

—  :  Twisted  bus 
of  width  4 


Figure  5  TQ(5)  with  diameter  3 

only  r(n+1)/2l  and  a  more  complicated  routing 
algorithm.  The  detail  is  in  Blackwell  et  al.  (1988). 

Using  TQ(5)  as  the  basic  module,  which  has  diameter 
3,  we  can  construct  MTQ(n)  similarly  to  Section  II,  so 
that  diameter  MTQ(n)=  Fsn/sl  . 

Formal  definition  is  given  as  follows. 

Definition . 

0.  MTQ(O)  is  a  single  node. 

1 .  MTQ(5k+i)=TQ(i)  x  MTQ(5k),  for  i=1 ,2 . 5. 

The  routing  is  done  by  correcting  5  consecutive  bits 
as  a  group  by  3  links.  The  same  routing  algorithm  for 
TQ(5)  is  used  at  each  node  after  the  active  5  bits  are 
selected.  Note  that  a  lookup  table  for  TQ(5)  has  32 
entries  instead  of  8  as  in  Section  III,  in  which  the 

diameter  is  larger. 

By  choosing  yet  bigger  but  more  compact  basic 
modules  from  Blackwell  et  al.  (1988),  we  can  define 
more  compact  MTQ  with  more  complicated  routing 
algorithms  (or  bigger  lookup  tables). 


Remarks.  The  definition  can  be  extended  to 
arbitrarily  large  n.  The  resulting  graph  TQ(n)  retains 
most  of  the  nice  topological  properties  such  as 
regularity  and  strong  connectivity,  but  with  diameter 


Figure  4  TQ(4)  with  diameter  3 


6.  CONCLUSION 

We  show  that  by  constructing  the  twisted  cube 
hierarchically,  one  can  reduce  the  diameter  of  Q(n) 
by  a  constant  factor,  such  as  2/3  ,  and  still  keep  the 
routing  very  simple.  Theoretically,  the  constant  factor 
can  be  made  arbitrarily  close  to  1/2,  although  the 
additional  complication  of  routing  may  make  it 
undesirable. 

It  is  interesting  to  note  that  in  any  hypercube 
machine  such  as  The  Connection  Machine  where  the 
routing  is  done  in  parallel  by  correcting  1  bit  at  a  time 
on  the  hypercube,  if  we  reconnect  the  physical  links 
to  make  it  a  MTQ(n)  such  that  the  bit  positions  which 
are  corrected  earlier  correspond  to  higher  level  links, 
then  the  computer  would  work  as  usual  without 
modifying  the  routing.  The  resulting  routing  algorithm 
would  be  able  to  route  2^^  packages  in  parallel  in  n 
steps  ,assuming  no  contention,  on  a  twisted  cube 
with  diameter  f  2n/3l . 
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All-Subsets  Regression  on  a  Hypercube  Multiprocessor 
Peter  Wollan,  Michigan  Technological  University 


Introduction.  Parallel  multiprocessor  computers 
have  been  hailed  as  the  next  dramatic  improvement 
in  computing  power.  The  object  of  this  paper  is  to 
explore  the  use  of  one  type  of  parallel  computer  (a 
distributed-memory  system)  in  data  analysis. 
All-subsets  regression  was  chosen  as  a  suitable 
vehicle  with  which  to  gain  experience:  it  is  a 
data-analysis  procedure  that  is  implemented  in  most 
statistical  packages,  yet  it  requires  enough 
computation  that  standard  mainframe  computers  may 
not  provide  enough  power  even  for  reasonably  small 
problems;  moreover,  it  appeared,  in  advance,  to  be 
inherently  parallelizable. 

Parallel  computers  come  in  essentially  two 
varieties:  shared  memory,  in  which  each  processor 
has  access  to  all  of  a  common  memory,  and 
distributed  memory,  in  which  each  processor  has  its 
own  separate  memory  and  sends  messages  to  the 
other  processors.  In  both  cases,  the  goal  has  been  to 
provide  greater  computational  speed  by  dividing  a 
problem  into  pieces  which  can  be  computed 
simultaneously,  and  then  recombined  into  a  solution. 
The  shared  memory  systems  have  been  found  to  be 
difficult  to  implement,  both  in  hardware  and  system 
software,  but  once  implemented  they  can  be  used,  at 
least  at  some  level,  with  comparative  ease.  Most 
users  of  Cray  systems,  for  example,  treat  the  system 
as  if  it  were  a  single  processor,  and  the  effect  of 
having  several  processors  is  to  increase  the  number 
of  users  that  can  be  serviced.  Distributed  memory 
systems,  on  the  other  hand,  are  comparatively  easy 
and  cheap  to  build,  but  require  the  user  to  explicitly 
parcel  out  computations  to  the  separate  processors, 
and  to  explicitly  send  messages  among  them  to  keep 
the  computations  coordinated. 

The  particular  machine  used  here  was  an  Intel 
iPSC-d4,  which  is  a  16-processor  hypercube. 
"Hypercube"  refers  to  the  communication  links  among 
the  processors.  Because  of  hardware  limitations,  it 
is  not  possible  to  connect  every  processor  with  every 
other.  Several  communication  patterns  have  been 
tried,  and  have  acquired  names;  the  hypercube 
architecture  seems  to  be  the  most  common  at  this 
point.  It  can  be  thought  of  as  placing  2*^  processors 
at  the  vertices  of  a  d-dimensional  cube,  with 
communication  links  provided  only  along  the  edges  of 
the  cube.  Hence,  each  processor  can  directly 
communicate  with  d  other  processors,  and  messages 
to  any  other  processor  must  be  relayed.  If  the 


vertices  are  denoted  by  d-coordinate  vectors  of  O's 
and  1's,  then  a  communication  link  exists  between 
two  vertices  if  the  corresponding  vectors  differ  in 
only  one  coordinate.  Often,  the  vectors  are  thought 
of  as  d-bit  binary  integers,  so  that  processor  number 
8  in  a  4-dimensional  hypercube,  for  example,  is  at 
vertex  (1 ,0,0,0),  and  its  neighbors,  with  which  it  can 
directly  communicate,  are  processors  9,  10, 12,  and 
0. 

The  Intel  iPSC  has  an  additional  processor,  with 
its  own  memory,  called  the  host.  The  host  can 
communicate  directly  with  all  the  other  processors, 
which  are  called  nodes.  While  communication 
between  nodes  is  relatively  fast,  communication 
between  host  and  nodes  is  relatively  slow,  and 
communication  from  the  user  to  a  node  must  pass 
through  the  host. 

As  noted  above,  it  is  necessary  for  the  user  to 
explicitly  apportion  computations  among  the  nodes. 
Ideally,  this  will  be  done  in  such  a  way  that  all 
processors  are  kept  busy  the  same  amount  of  time, 
and  so  that  no  processor  is  forced  to  wait  for 
another  to  complete  an  intermediate  result.  The 
procedure  described  here  is  not  optimal,  but  is 
reasonably  close,  and  uses  the  parallel  nature  of  the 
machine  in  an  acceptably  efficient  way. 

Regressions  were  computed  with  the  Sweep 
algorithm,  which  is  well-known  and  widely  used 
(see,  for  example,  Weisberg  (1987),  p  60).  Any 
implementation  of  the  Sweep  must  use  some  method 
of  checking  for  collinearity,  if  only  to  avoid  dividing 
by  zero.  Largely  out  of  curiosity,  the  method  chosen 
was  that  proposed  by  Berk  (1977);  the  behavior  of 
this  portion  of  the  program  turned  out  to  be,  in  many 
ways,  more  interesting  than  the  parallel  part. 

Section  2  describes  the  program,  and  gives  some 
details  about  its  components.  Section  3  describes 
its  performance,  and  includes  a  comparison  with  SAS 
Proc  Rsquare.  Section  4  concludes  with  some 
comments  about  using  distributed  memory  computers 
for  all-subsets  regression,  and  some  more  tentative 
comments  about  using  them  for  data  analysis  in 
general. 

2.  The  Algorithm.  All-subswts  regression  is  often 
used  to  find  the  best  set  of  predictor  variables  for  a 
regression  model.  Given  k  predictors  and  a  response, 
the  linear  regression  is  computed  for  each  subset  of 
predictors,  and  the  best  model  is  chosen  by  some 
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criterion.  Since  there  are  2*^-1  non-trivial  subsets, 
the  number  of  models  to  be  computed  is  very  large 
even  for  reasonably  small  data  sets. 

The  Sweep  is  an  in-place  matrix  inversion 
algorithm:  starting  from  a  correlation  matrix,  it 
produces  both  standardized  regression  coefficients 
and  (1  -r2).  It  is  attractive  for  regression  for  a 
number  of  reasons:  it  is  easy  to  program,  uses 
relatively  little  memory,  and  is  numerically 
reasonably  stable.  It  has  two  other  features  that  are 
important  for  certain  procedures  such  as  stepwise 
regression  and  all-subsets  regression:  sweeping  on  a 
set  of  pivots  produces  the  same  result,  no  matter 
what  order  is  used;  and  sweeping  on  a  pivot  a  second 
time  has  the  effect  of  deleting  the  corresponding 
variable  from  the  model.  Consequently,  a  predictor 
can  be  introduced  into  a  model  or  deleted  from  a 
model  in  essentially  the  same  amount  of  time.  This 
feature  allows  the  following  approach  to  all-subsets 
regression:  let  the  2^  regression  models  correspond 
to  vertices  of  a  k-dimensional  hypercube,  where  a 
model  is  described  by  a  vector  of  O's  and  Vs,  with  1 
in  the  ith  coordinate  indicating  the  ith  predictor  is 
present  in  the  model.  Then,  the  Sweep  allows  moving 
from  one  model  to  another  along  an  edge  of  the 
hypercube,  and  a  sequence  of  models  determines  a 
path  along  edges. 

Using  one  processor,  an  efficient  sequence  of 
models  corresponds  to  a  path  that  passes  through 
each  vertex  of  the  k-dimensional  hypercube  and  does 
not  pass  through  any  vertex  twice;  if  we  add  the 
requirement  that  the  last  model  be  one  step  away 
from  the  first  (null)  model,  the  path  is  a  Hamiltonian 
circuit.  There  are  many  Hamiltonian  circuits  for  the 
hypercube.  One  is  given  by  the  well-known  Gray  code 
(see.  for  example,  Kohavi,  1978,  p.  13),  which  allows 
computing  the  i**^  vertex  of  the  path  from  the  binary 
representation  of  the  integer  i,  using  simple  binary 
arithmetic. 

Using  2^  processors,  an  optimal  sequence  of 
models  corresponds  to  a  set  of  2^  paths,  passing 
through  every  vertex,  all  starting  at  the  origin 
(corresponding  to  the  null  model),  which  do  not  cross 
each  other,  and  which  are  all  nearly  the  same  length 
(since  they  all  start  at  the  origin,  they  can't  be 
exactly  the  same  length).  For  certain  k  and  d,  such 
sets  exist;  however,  there  seems  to  be  no  way  to 
extend  solutions  for  small  cubes  to  larger  ones,  and 
for  some  k  and  d  there  may  be  no  solution. 

However,  a  nearly  optimal  set  of  paths  can  be 
obtained  from  the  Gray  code  mapping  of  the 
k-dimensional  hypercube  into  the  2^  x  2*^'^  torus. 


as  follows:  a  vertex  of  the  cube  is  mapped  onto  a 
point  in  a  2^  x  2*^'*^  rectangular  grid.  The  row  is 
determined  by  applying  the  Gray  code  to  the  first  d 
coordinates,  and  the  column  by  applying  the  Gray 
code  to  the  remaining  k-d  coordinates.  (The  grid  is 
a  torus  in  the  sense  that  the  Gray  code  "wraps 
around"  in  both  rows  and  columns)  Each  processor, 
then,  is  assigned  a  row  of  the  grid,  in  order  to  get 
to  the  beginning  of  its  row,  the  processor  must 
introduce  some  variables,  so  there  will  be  some 
duplication  of  effort  among  the  processors;  but  no 
more  than  d  variables  need  be  introduced,  so  that 
the  longest  path  is  only  d  steps  longer  than  the 
shortest. 

The  parallelization  of  all-subsets  regression, 
then,  can  be  described  as  follows:  Each  node  is 
provided  with  the  correlation  matrix  of  the  data.  It 
introduces  a  set  of  predictors,  to  get  to  its  row  of 
the  torus,  then  computes  the  models  on  its  row, 
saving  appropriate  statistics  from  each  model. 

When  the  node  is  done,  it  sends  the  collected 
statistics  to  the  host  for  output.  Communication 
among  nodes  is  involved  only  at  the  beginning,  when 
the  data  is  being  passed  out,  and  at  the  end,  when 
results  are  collected.  The  program  uses  recursive 
doubling  to  broadcast  the  correlation  matrix  to  the 
nodes:  the  host  sends  the  matrix  to  one  node.  The 
node  sends  the  matrix  to  another;  both  nodes  then 
send  it  to  others,  and  so  on,  with  the  number  of 
nodes  receiving  the  matrix  doubling  at  each  step. 
This  procedure  uses  both  the  inter-node 
communication  links,  which  are  very  fast,  and  the 
parallel  communication  features  of  the  hypercube 
architecture.  Recursive  halving  could  have  been 
used  to  collect  the  results  in  one  node  for  output, 
but  this  was  found  to  be  inefficient:  the  amount  of 
output  was  fairly  large,  and  there  is  a  limit  on  the 
size  of  each  message.  Collecting  all  the  output  in 
one  node,  and  sending  it  in  smaller  parcels  to  the 
host,  was  less  efficient  than  simply  having  each 
node  send  its  results  to  the  host  directly. 

The  computation  of  the  regression  models 
required  checking  for  singularity  of  the  matrix  at 
each  stage.  Berk  (1977)  proposed  a  procedure  in 
which  the  model  is  rejected  (the  predictor  is  not 
allowed  to  be  introduced)  if  the  trace  of  the 
submatrix  corresponding  to  the  model  is  greater 
than  the  tolerance  divided  by  p,  where  p  is  the 
number  of  predictors  introduced  and  the  tolerance 
is  chosen  by  the  user  (here,  1 000).  This  was 
justified  by  an  inequality  involving  the  condition 
number.  It  is  quite  different  from,  and  seems  to  be 
substantially  more  conservative  than,  the 
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procedures  proposed  by  Stewart  (1987)  and  Beaton, 
Rubin,  and  Barone  (1976).  Implementing  Berk's 
procedure  within  the  all-subsets  regression 
program  required  some  bookkeepping;  it  was 
necessary  to  keep  track  of  the  "official"  model, 
given  by  the  Gray  code,  and  also  the  "actual"  model, 
those  predictors  that  were  allowed  to  be 
introduced.  Moreover,  deleting  variables  often 
required  recomputing  the  model  from  scratch,  since 
a  predictor  that  had  been  refused  admittance 
earlier  might  be  allowable  in  the  smaller  model. 

The  program  was  written  in  Intel's  version  of 
FORTRAN  77,  which  has  a  number  of  extensions  to 
provide  for  communication  between  nodes.  Each 
manufacturer  has  chosen  its  own  set  of  extensions 
for  this  purpose,  and  Intel  even  changed  the  syntax 
substantially  when  it  released  the  second  version 
of  the  iPSC.  Consequently,  the  code  will  not  run  on 
any  other  machine,  and  is  not  reproduced  here;  it  is 
available  from  the  author  on  request.  The 
program's  structure  is  as  follows:  the  host 
program  reads  a  correlation  matrix  from  a  file, 
sends  it  to  one  node,  and  then  waits  for  output.  As 
it  receives  output  from  each  node,  it  writes  it  to  a 
file.  The  same  program  executes  on  each  node;  the 
program  can  ask  which  node  it  is  running  on,  and 
take  different  action  depending  on  the  answer.  The 
node  program  begins  by  receiving  the  correlation 
matrix,  then  sends  it  on  to  others,  using  recursive 
doubling.  The  main  program  computes  the  variable 
to  be  introduced  or  deleted  for  the  next  model, 
using  the  Gray  code  algorithm;  a  subroutine 
computes  the  Sweep,  and  another  subroutine 
recomputes  the  model  when  necessary.  As  models 
are  computed,  the  program  saves  R^  and  two 
numbers  describing  which  variables  have  been 
introduced;  when  all  models  have  been  computed, 
the  collected  results  are  sent  to  the  host. 

3.  Results.  The  parallelization  works  well.  The 
speedup  factor  is  about  .95  (that  is,  the  time 
required  for  1  processor,  divided  by  the  time 
required  for  n  processors,  is  approximately  .95n), 
for  problems  in  the  range  of  8  to  1 5  predictors  and 
up  to  16  processors.  In  other  words,  a 
16-processor  machine  takes  slightly  more  than 
1/16  of  the  time  needed  by  a  1 -processor  machine 
for  the  same  problem.  This  is  not  surprising,  since 
it  is  generally  communication  overhead  that 
matters  as  getting  the  output  in  readable  form  to  the 
user,  one  could  conclude  that  the  Intel  iPSC  offers 
computing  speed  roughly  comparable  to  an  IBM 
mainframe,  and  at  substantially  lower  cost.  In  fact, 


the  time  of  1 .37  seconds  is  unfair  to  Intel;  our 
particular  machine  has  been  running  20  to  30  times 
slower  than  than  it  should  be,  probably  because  of 
some  undiscovered  mis-specification  in  the 
installation  of  the  operating  system.  In  addition,  the 
new  version  of  the  machine  is  a  great  deal  faster. 

However,  SAS  (and  BMDP,  and  IMSL)  use  the 
Furnival-Wilson  Branch  and  Bound  algorithm  for 
all-subsets  regression  (see,  for  example,  Hocking 
1976)  which  computes  only  the  best  models  of  each 
size.  For  the  best  5  models  of  each  size,  for  the 
10-predictor  problem  described  above,  SAS  Proc 
Rsquare  required  only  .43  seconds. 

Another  feature  of  the  program.  Berk’s 
singularity  check,  behaves  in  an  interesting  way:  in 
effect,  it  gives  a  means  of  finding  the  best 
acceptable  model.  Even  though  the  output,  in  its 
present  form,  displays  only  R^  and  two  coded 
integers  describing  the  models,  one  can  easily  scan 
the  output  and  find  those  models  which  both  have 
high  R^  and  pass  the  tolerance  test.  For  example,  for 
the  Longley  data  (see,  for  example,  Beaton,  Rubin  and 
Barone,  1976)  one  can  quickly  see  that  the  best 
acceptable  three-variable  model  is  obtained  by 
fitting  the  variables  Unemployment,  Size  of  Armed 
Forces,  and  Year,  where  "best"  is  in  the  sense  of 
greatest  R^,  and  "acceptable"  is  in  the  sense  of 
passing  Berk's  tolerance  test,  with  tolerance  equal 
to  1000.  One  also  sees  that  adding  a  fourth  variable, 
Noninstitutional  Population,  yields  a  slightly  higher 
R^  and  still  passes  the  test.  It  should  be  noted  that 
these  models  are  quite  different  from  the  ones 
Beaton,  Rubin,  and  Barone  suggested;  furthermore,  it 
is  difficult  or  impossible  to  obtain  qualitatively 
similar  results  from  SAS,  BMDP,  or  IMSL.  SAS  Proc 
Rsquare  apparently  does  not  check  for  collinearity  or 
tolerance  in  any  way  at  all;  IMSL  subroutine  RLEAP 
does  check  for  singularity,  but  the  manual  does  not 
describe  what  method  is  used:  and  BMDP-9R  carries 
out  a  tolerance  check,  but  terminates  when  a  model 
faiis  the  test. 

4.  Conclusions.  All-subsets  regression  is  a  large 
enough  computing  problem  for  parallel  computers  to 
be  potentially  useful.  However,  the  experience 
gained  here  indicates  that  distributed-memory 
systems,  as  they  are  presently  designed,  have 
serious  shortcomings  which  make  their  use  for  this 
problem  doubtful  in  spite  of  their  speed. 

The  Furnival-Wilson  algorithm  is  clearly  the 
best  way  to  screen  a  large  number  of  models,  which 
is  generally  what  people  want  to  do  when  they  use  an 
all-subsets  regression  program.  The  advantage  of 
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computing  only  the  good  models  is  already 
substantial  for  10  predictors,  and  it  increases 
dramatically  as  the  problem  gets  larger.  It  may  be 
possible  to  code  this  algorithm  for  a 
distributed-memory  system,  but  it  is  not  at  all  clear 
how  to  do  it.  In  fact,  the  algorithms  that  have  been 
successfully  parallelized  for  these  systems  have 
tended  either  to  assign  distinct,  essentially 
independent  computations  to  each  processor,  as  was 
done  here,  or  to  implement  large  matrix  methods  and 
(roughly  speaking)  give  a  portion  of  the  matrix  to 
each  processor.  The  Furnival-Wilson  algorithm  is  of 
a  different  form  altogether:  its  efficiency  derives 
from  eliminating  potential  cases,  and  at  any  given 
time  there  is  not  a  great  deal  of  computing  to  be 
done. 

Computing  every  regression  model,  as  is  done 
here,  is  not  likely  ever  to  be  very  attractive. 

However,  it  does  allow  multiple,  conflicting 
screening  criteria.  In  particular.  Berk's  tolerance 
check  is  potentially  interesting  as  a  means  of 
diagnosing  and  handling  multicollinearity.  It  may  be 
possible  to  include  multiple  criteria  in  the 
Furnival-Wilson  algorithm;  this  would  achieve  the 
best  of  both  worlds. 

Regarding  the  general  use  of 
distributed-memory  systems  for  data  analysis, 
several  limiting  features  have  become  apparent. 

First,  input  and  output  are  severely  restricted.  As 
they  are  designed  now,  these  machines  are 
completely  inappropriate  for  such  input-bound 
problems  as  reduction  of  huge  data  sets.  There  is  not 
only  a  bottleneck  at  the  host,  there  is  a  limit  on  the 
size  of  messages  sent  between  nodes;  for  the 
all-subsets  regression  program  on  the  Intel  iPSC,  the 
nodes  become  unable  to  send  their  accumulated 
output  in  a  single  message  when  the  number  of 
predictors  is  only  1 5,  even  though  the  output 
consists  of  only  three  numbers  per  model.  (The  new 
version  of  the  iPSC  also  has  a  limit,  but  somewhat 
larger).  One  could  reasonably  want  to  see  a  residual 
plot,  or  a  set  of  several  diagonostic  plots,  for  each 
model;  that  is  not  practical  now. 

Second,  the  process  of  programming  the  system 
is  very  difficult.  To  some  extent,  this  is  due  to  the 
fact  that  the  machines  are  new,  and  programming 
tools  are  still  being  developed.  For  example,  the  new 
version  of  the  Intel  iPSC  comes  with  a  debugging 
package  that  represents  a  major  improvement. 
However,  programming  any  distributed  memory 
system  requires  using  some  elementary  programming 
structures  that  are  very  different  from  those  taught 
in  traditional  programming  courses.  One  example  is 


the  Gray  code,  needed  to  descpbe  which  nodes  are 
adjacent  to  which  others;  another  is  the  recursive 
doubling  communication  algorithm.  These,  and 
others,  are  part  of  the  basic  language  of  the  program, 
just  like  arrays  and  Do  loops.  In  addition,  parallel 
algorithms  require  a  substantially  different  way  of 
thinking  about  problems. 

Accumulated  experience  will  improve  both  the 
programming  tools  and  the  programmer's  knowledge 
and  skill,  but  there  appears  to  be  a  fairly  large  class 
of  problems  that  simply  aren't  suited  for 
distributed-memory  systems.  One  example  seems  to 
be  the  Furnival-Wilson  algorithm.  Another  is  the 
computation  of  a  correlation  matrix:  covariances  can 
be  computed  in  parallel  by  giving  each  node  a  set  of 
cases,  and  computing  partial  sums,  first  for  means 
and  then  for  cross  products,  and  exchanging  the 
partial  sums  among  the  nodes  so  that  each  node  ends 
up  with  the  full  covariance  matrix.  However,  the 
final  step,  going  from  covariances  to  correlations,  is 
difficult  to  paralellize  efficiently. 

Distributed  memory  parallel  computers  are  very 
fast  and  powerful;  but  programming  them  requires 
new  techniques  and  unfamiliar  tricks,  and  their  full 
power  may  be  usable  only  for  certain  kinds  of 
problems.  Overall,  they  appear  to  be  special-purpose 
machines,  whose  capabilities  satisfy  only  some  of 
the  needs  of  data  analysis. 
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Testing  Parallel  Random  Number  Generators 

Mark  J.  Durst,  Lawrence  Livermore  National  Laboratory 


As  multiprocessor  computers  and  networked  com¬ 
putations  become  more  common,  there  is  a  need  for 
parallel  pseudo-random  number  generation.  This  can 
be  thought  of  as  the  provision  of  many  streams  of 
pseudo-random  numbers,  which  should  appear  to  be 
independent  within  each  stream  and  across  streams. 
Some  ways  of  constructing  tests  for  parallel  random 
number  generators  are  discussed,  along  with  the  com¬ 
putational  limits  on  them.  Experience  using  these 
tests  to  construct  parallel  random  number  genera¬ 
tors  for  the  Cray  X-MP  has  failed  to  produce  a  par¬ 
ticularly  powerful  set  of  tests,  and  so  the  importance 
of  constructing  computation-specific  tests  is  stressed; 
some  guidance  is  offered  for  these  constructions. 

1  Introduction 

The  use  of  multiprocessor  computers  and  networks 
of  computers  to  solve  serious  problems  with  parallel 
computations  is  becoming  more  common.  Since  these 
computers  are  in  general  asynchronous,  standard  sys¬ 
tem  pseudo-random  number  generators  (RNG’s)  are 
insufficient,  as  they  lack  repTvducibihly:  a  guaran¬ 
tee  that  different  runs  of  a  program  will  give  the 
same  result.  Non-reproducible  runs  should  only  vary 
at  the  statistical  level,  •  d  so  production  runs  of 
a  simulation  or  Monte  ,rlo  calculation  can  often 
relinquish  reproducibility  and  use  standard  system 
RNG’s.  However,  for  debugging  purposes  and  for 
complex  Monte  Carlo  calculations  requiring  intricate 
traces  (for  instance,  where  one  wishes  to  target  spe¬ 
cific  histories  for  future  variance  reduction),  one  must 
be  able  to  reproduce  computations  exactly,  and  .so 
parallel  random  number  generators  (PRNG's)  arc  re¬ 
quired. 

A  PRNG  can  be  viewed  as  a  method  for  pro¬ 
ducing  multiple  streams  of  pseudo-random  numbers, 
and  there  is  experience  with  such  methods;  some 
discu.ssion  occurs  in  Frederickson  et  al.  (1981).  and 
Schrubeii  and  Margolin  (1978,  p.  507)  comment: 
“When  two  sets  of  pseudorandom  number  strearas  are 


generated  using  different  randomly  selected  vectors 
of  seeds,  the  two  resulting  time  series  samples. . .  are 
typicedly  observed  to  be  uncorrelated.”  Past  work  has 
focused  on  providing  a  very  small  number  of  streams; 
current  computing  demands  many  more.  Relatively 
inexpensive  computers  are  now  avaulabie  with  a  thou¬ 
sand  processors,  and  the  ability  to  create  logical  tasks 
which  do  not  necessarily  correspond  to  physical  pro¬ 
cessors  creates  programs  with  a  need  for  even  more 
streams;  a  production  code  at  LLNL  demands  the 
availability  of  seventy  million  (short)  streams. 

Without  ad  hoc  modifications  (see  Durst  (1988)), 
the  most  promising  techniques  for  parallel  random 
number  generation  involve  splitting  up  the  cyclic 
stream  of  a  given  random  number  genera'or  into 
substreams.  This  provides  substreams  of  sufficient 
size  for  most  current  applications  (particularly  if  one 
splits  the  stream  from  a  generalized  feedback-shift 
register  or  lagged-Fibonacci  generator),  but  strains 
the  discrepancies  of  current  RNG’s.  Few  applications 
of  standard  RNG’s  use  the  independence  of  more 
than  about  a  dozen  dimensions;  good  discreptincies 
at  these  dimensions  are  provided  by  the  above  gener¬ 
ators,  as  well  as  by  largp- modulus  (48  bits  and  above) 
congruentials.  However,  parallel  computations  may 
require  that  some  dozens  of  streams  appear  indepen¬ 
dent  in  a  dozen  or  so  dimensions;  the  required  dis¬ 
crepancies  in  hundreds  of  dimensions  are  far  beyond 
the  discrepancies  of  congruential  RNG's  (CRNG’s), 
and  are  at  or  beyond  those  for  generalized  feedback 
register  generators  (GFSR’s). 

While  conceding  the  theoretical  shortcomings  of 
rurrent  methods,  though,  it  should  be  pointed  out 
that  many — perhaps  most — Monte  Carlo  calculations 
and  simulations  do  not  have  sophisticated  indepen¬ 
dence  requirements,  and  can  even  succeed  by  appro¬ 
priately  splitting  a  CRNG.  Empirical  tests  should  be 
used  to  verify  minimally  good  properties  of  PRNG's. 
to  provide  simplified  paradigms  of  complex  calcula¬ 
tions.  and  to  check  for  the  necessity  and  efficacy  of 
modifications  to  PRNG  methods.  While  such  tests 
are  well-known  for  standard  RNG's  (see,  for  exam- 
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pie,  Knuth  (1981)  and  Marsaglia  (1985))  they  gen¬ 
erally  focus  on  testing  a  single  stream;  here  tests 
are  required  of  the  interdependence  of  many  different 
streams  (it  is  assumed  that  standard  RNG  tests  will 
be  used  to  check  the  quality  of  individual  streams). 
In  this  paper  a  few  basic  ways  of  constructing  tests 
are  discussed,  with  some  recommendations  and  com¬ 
ments  on  computational  constraints. 

2  Some  PRNG  Tests 

Correlation  tests  are  not  very  powerful,  working  only 
with  pairwise  behavior  and  detecting  only  the  most 
serious  dependencies.  However,  they  are  an  impor¬ 
tant  class  of  tests,  since  (as  will  be  seen)  no  other 
testing  regimen  used  here  can  verify  much  about  cor¬ 
relations  at  many  lags.  The  desire  with  correlation 
tests  is  to  guarantee,  for  m  streams,  each  of  length  n, 
that  correlations  of  lag  k  or  less  are  under  control.  For 
many  applications,  k  can  be  on  the  order  of  one  or 
two  dozen,  while  sansitive  applications  may  require 
that  k  be  on  the  order  of  several  hundred.  There 
are  omnibus  tests  for  the  independence  of  multivari¬ 
ate  normals  (see  Anderson  (1958),  Chapter  9)  which 
can  be  used  asymptotically.  One  can  also  test  the 
correlation  coefficients  with  a  Bonferroni  test.  If  the 
correlations  are  computed  directly,  the  computation 
time  is  0{km^n),  but  Fourier  techniques  can  bring 
this  down  lo  0{m?'n  log(n)).  These  tests  are  most  ef¬ 
fective  at  finding  streams  which  are  exact  duplicates 
or  antithetic  variates  at  relatively  small  lags. 

Laitiudtnal  tests  are  a  general  way  of  constructing 
tests  for  PRNG's.  Ordinary  RNG  tests  split  a  se¬ 
quence  longitudinally,  with  a  short  sequence  provid¬ 
ing  the  numbers  needed  to  compute  one  observation, 
and  adjacent  short  sequences  used  to  provide  repeat 
observations.  In  a  latitudinal  test  one  number  from 
each  of  a  fixed  number  of  streams  is  used  to  provide 
one  observation;  repeated  observations  are  then  ob¬ 
tained  by  proceeding  longitudinally.  Of  course,  lat¬ 
itudinal  and  longitudinal  testing  can  be  combined. 
For  instance,  a  four-dimensional  equidistribution  test 
can  be  used  to  compare  two  longitudinal  dimensions 
of  two  streams.  Latitudinal  tests  can  be  constructed 
from  any  test  which  always  generates  an  observation 
from  the  same  size  set  of  numbers.  It  is  not  obvious 
how  to  adapt  other  tests,  such  as  gap  tests  and  runs 
te-ts.  for  latitudinal  use.  Tests  for  latitudinal  use 
usually  should  not  depend  on  the  order  in  which  the 
numbers  appear:  however,  in  some  applications,  there 


may  be  a  natural  stream  ordering,  which  should  be 
incorporated  into  tests.  In  small  dimensions,  equidis- 
tnbution  tests  on  unit  hypercubes  can  be  used.  For 
somewhat  larger  dimensions,  permutation  and  parti¬ 
tion  tests  can  detect  some  bad  deficiencies.  For  still 
larger  dimensions,  most  tests  have  been  designed  ad 
hoc  and  have  not  been  very  useful  in  testing  PRNG’s, 
but  should  probably  be  considered:  examples  are  col¬ 
lisions  tests  and  tests  btised  on  transforming  the  max¬ 
imum  of  a  number  of  uniforms  to  uniformity  (Knuth 
(1981),  pp.  68-70),  and  the  “Birthday  Spacings”  test 
of  Marsaglia  (1985). 

Given  the  insensitivity  of  high-dimensional  latitu¬ 
dinal  tests,  one  would  ideally  compute  tests  for  all 
possible  subsets  in  low  dimensions.  With  m  streams, 
a  test  of  dimensionality  j,  and  k  lags  under  considera¬ 
tion,  this  would  involve  ‘ik[’J)  tests.  This  is  infeasible 
unless  either  m  is  small  (a  dozen  or  two)  or  j  is  very 
small  (2,  3,  or  d).  Since  uniformity  in  the  lowest  di¬ 
mensions  is  always  desired,  it  is  recommended  to  use 
an  equidistribution  test  in  dimension  2  (and  3  if  fea¬ 
sible)  on  all  pairs  (triplets)  of  streams.  For  formal 
testing,  Bonferroni  tests  should  be  used  until  infor¬ 
mation  on  the  joint  distribution  of  the  C^)  possible 
p- values  is  available. 

Another  possibility  is  to  compute  a  small  random 
sample  of  the  (^)  possible  tests.  If  m  »  j,  then  all 
these  tests  should  be  effectively  independent.  Note, 
however,  that  the  probabilistic  guarantees  afforded 
by  such  a  test  are  only  useful  if  streams  with  a  small 
fraction  of  dependencies  will  result  in  a  successful 
computation. 

3  Experience 

We  Lave  done  empirical  testing  in  the  course  of  con¬ 
structing  three  parallel  random  number  generators: 

•  A  default  vectorized  PRNG,  intended  for  sim¬ 
ple  use  with  a  moderate  number  of  tasks  (up  lo 
several  hundred)  on  a  Cray  X-MP  computer: 

•  A  scalar  PRNG  for  a  physics  simulation  using 
up  to  many  millions  of  short  (at  most  several 
thousand)  streams  oti  a  Cray  X-MP  computer, 
and 

•  A  special-purpo.se  PHNG  for  a  physics  computa¬ 
tion  on  an  eight -processor  Alliant  computer 

For  the  first  two  generators,  we  chose  to  split  th. 
sequence  from  the  default  Cray  HNG  RANK,  a  niul- 
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tiplicative  congruential  generator  with  modulus  2''* 
and  multiplier  44485709377909.  This  choice  was 
made  for  three  reasons:  compatibility  with  results 
from  older  codes;  availability  of  the  spectral  test 
(which  can  be  derived,  for  splitting  use,  from  the 
work  of  Percus  and  Kalos  (in  press)):  and,  for  the  sec¬ 
ond  generator,  small  state  space  (one  word).  For  the 
first  generator,  which  requires  initialization  to  select 
a  maximum  possible  number  of  streams,  we  consid¬ 
ered  forcing  an  odd  number  of  streams  and  evenly  (or 
nearly  evenly)  splitting  the  sequence,  evenly  splitting 
the  sequence  and  then  backing  off  a  fixed  amount,  and 
evenly  splitting  the  sequence  and  then  backing  off  a 
fixed  fraction.  Testing  was  intended  to  compare  these 
three  schemes.  For  the  second  generator,  where  the 
number  of  streams  was  only  bounded  by  70,000,000, 
we  decided  to  provide  a  default,  even  jump  between 
streams,  and  so  wanted  to  use  testing  to  help  select 
that  jump.  The  third  generator  required  that  either 
very  long  streams  or  very  many  streams  be  avail¬ 
able,  which  strained  the  congruential;  we  decided  to 
split  the  sequence  from  a  lagged-Fibonacci.  In  the 
absence  of  deterministic  testing,  we  decided  not  to 
risk  even  sequence  spacing,  and  so  generated  starting 
points  with  a  congruential  generator,  in  the  hope  of 
distributing  starting  points  at  random  through  the 
sequence  of  the  lagged-Fibonacci.  We  used  testing 
to  check  for  overall  interstream  quality  and  to  insure 
that  the  starting  point  mechanism  was  not  too  bad. 

We  also  tested  various  straw  men.  One  was  an  even 
split  of  the  aforementioned  RANF  into  2'  streams  for 
c  from  1  up  to  16.  Another  was  a  split  which  either 
used  very  small  lags  (we  tested  1,  5,  and  40)  or  used 
small  lags  to  an  even  split  as  above,  A  final  set  of 
straw  men  were  generators  which  generated  duplicate 
streams  (both  a  small  number  and  a  small  fraction) 
and  streams  which  were  mixtures  of  other  streams. 

The  tests  used  were  correlation  tests,  a  two- 
dimensional  equidistribution  test  (five  bits),  a  four¬ 
dimensional  equidistribution  test  (three  bits),  permu¬ 
tation  tests  up  to  dimension  six.  a  collisions  test  in 
dimension  20  (which  tested  only  the  top  bit),  and  the 
Birthday  Spacings  test  in  dimension  256.  The  Iwo- 
and  four-dimensional  test.s  were  u.sed  on  all  subsets 
for  up  to  32  and  10  streams  respectively,  and  all  lat¬ 
itudinal  tests  wer<  used  in  random  combinations  for 
2''  streams  with  c  up  to  16.  For  the  random  combina¬ 
tions.  .some  tests  were  done  with  a  randomly  chosen 
lag  on  each  stream.  Lags  were  chosen  with  a  geomet¬ 
ric  distribution,  with  the  probability  of  zero  lag  equal 


to  1/2. 

The  tests  were  very  effective  at  discovering  the 
straw  men,  with  the  exception  of  the  Birthday  .Spac¬ 
ings  test.  The  low-dimensional  tests  differed  from  the 
null  hypotheses  most  spectacularly.  Of  course,  the 
unlagged  tests  did  not  uncover  the  problem  with  the 
small-lag  streams;  those  were  most  reliably  detected 
by  the  correlations  tests,  which  worked  surprisingly 
well,  even  at  detecting  the  straw  men  involving  even 
splits.  For  even  splits  with  e  above  10,  the  strongest 
interstream  dependencies  involve  only  a  small  frac¬ 
tion  of  the  streams;  still,  cis  long  as  several  hundred 
randomly  chosen  tests  were  performed,  the  deficien¬ 
cies  were  noticed. 

The  tests  did  not  discover  deficiencies  in  the 
schemes  under  serious  consideration;  whether  this  in¬ 
dicates  lack  of  power  in  the  tests  or  good  parallel 
random  number  generators  is  unclear.  We  lacked  ex¬ 
act  joint  distributions  when  testing  all  subsets,  but 
no  values  ever  exceeded  the  Bonferroni  limits.  Tests 
were  iterated  and  analyzed  as  in  Fishman  and  Moore 
(1982),  but  still  no  clear  pattern  of  failure  emerged. 
The  testing  did  twice  uncover  what  turned  out  to  be 
programming  errors  which  generated  bad  or  badly 
dependent  streams,  so  there  may  be  some  hope  for 
these  specific  tests. 

4  Recommendations 

While  congruential  schemes  have  severely  limited 
discrepancies  (the  same  limits  first  described  by 
Marsaglia  (1968)  apply),  they  survive  tests  like  these. 
This  indicates  that  passing  such  tests  is  a  minimal 
requirement  for  parallel  random  number  generators. 
Better  tests  in  large  dimensions  remain  of  interest,  as 
the  power  of  existing  tests  in  hundreds  of  dimensions 
leaves  much  to  be  desired. 

For  specific  computations,  three  recommendations 
can  be  made.  The  first  is  that  tests  should  be  tailored 
to  the  application,  as  recommended  by  Marsaglia 
(1985)  Poor  results  from  bad  ordinary  RNG's  may 
provide  some  guidance  in  finding  specific  tests  to  turn 
into  latitudinal  tests.  The  second  recommendation  is 
that  specific  streams  with  crucial  independence  re¬ 
quirements  should  be  identified  and  tested  heavily 
For  instance,  if  streams  are  used  spatially,  then  each 
stream  should  be  tested  against  all  nearby  streaiiLS 
A  final  recommendation  is  that  users  of  parallel  ran¬ 
dom  number  generators  should  have  available  an  ad- 
hoc  scheme  for  improving  PRNG's,  using  shuffling  or 
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combination  (see  Durst  (1988)).  Suspicious  simula¬ 
tion  or  Monte  Carlo  results  can  then  be  submitted  to 
the  improved  scheme  for  validation.  Of  course,  the 
improvement  scheme  should  be  submitted  to  testing 
to  ensure  that  it  at  least  does  not  degrade  the  PRNG. 
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Interactive  Smoothing  Techniques 
Wolfgang  Hardle,  Universitat  Bonn 


Abstract 


For  effective  implementation  of  smoothing  techni¬ 
ques  a  conditio  sine  qua  non  is  an  interactive  computing 
environment.  We  describe  some  of  the  logical  structu¬ 
res  that  we  find  convenient  for  interactive  smoothing. 
These  structures  are  implemented  in  XploRe  -  a  com¬ 
puting  environment  for  parameter  free  regression  and 
density  smoothing  in  high  and  low  dimensions. 

0.  The  Smoothing  Analysis  Cycle 


Smoothing  means  parameterfree  estimation  of  re¬ 
gression  and  density  curves.  If  X  g  R‘^,y  g  IR  de¬ 
note  a  pair  of  random  variables,  it  is  the  tcisk  of  regres¬ 
sion  smoothing  to  estimate  the  mean  function  m(  )  = 
E{y\X  =  ■)  from  an  independent  sample  {(Xi, yi)}”_j. 
Density  smoothing  consists  of  finding  good  approximati¬ 
ons  to  the  density  function  /(•)  of  X  from  an  i.i.d.  sam¬ 
ple  If  no  parametric  restrictions  are  imposed 

on  these  curves  the  smoothing  technique  is  nonparame- 
tric  or  parameterfree  and  is  typically  based  on  "pooling 
neighboring  information”,  see  Stone  (1977). 

There  exists  a  wide  variety  of  methods  for  parame¬ 
terfree  estimation,  see  e.g.  Silverman  (1986).  These  me¬ 
thods  have  more  or  less  the  same  asymptotic  sharpness 
but  behave  quite  differently  for  finite  sample  'ize.  This 
is  a  situation  where  the  computer  can  be  a  very  good  2is- 
sistant:  smoothing  means  function  estimation  and  the¬ 
refore  different  results  can  only  be  studied  in  the  form  of 
comparing  graphs  or  tables  of  values.  Another  scenario 
in  this  setting  is  to  form  residuals  and  to  examine  them  in 
an  iterative  way  for  non-fitted  or  overfitted  structure,  see 
e.g.  the  backfitting  procedure  of  llastie  and  Tibshirani 
(1987).  Here  again  the  computer  is  agreat  assistant  in 
trying  several  alternatives. 

Smoothing  in  dimensions  of  X  bigger  than  two  crea¬ 
tes  difficulties  on  the  computational  and  on  the  statisti¬ 
cal  side.  First  of  all  one  cannot  study  the  full  fit  function 
without  additional  "artificial  dimensions”.  Scott  (1986) 
proposes  to  use  time  as  this  dimension  and  presents 
changing  density  contours  for  dimension  d  =  4.  .Se¬ 
condly,  in  data  sets  with  moderate  sample  size  there  is 
not  enough  data  to  perform  the  "local  data  pooling”  in 
an  effective  way.  (Theoretically  speaking,  this  means 
tliat  the  rate  of  ronvergence  of  nonparametric  smoo¬ 


thers  is  extremely  slow  for  large  dimensions  d,  see  Ibra¬ 
gimov  and  Hasminski  (1982)  and  Stone  (1982).)  Addi¬ 
tive  models  reduce  this  dimensionality  problem  but  re¬ 
quire  quite  a  bit  of  machine  power  e.g.  the  Projection 
Pursuit  Regression  (PPR)  algorithm  by  Friedman  and 
Stuetzle  (1981).  Interactive  control  of  such  an  additive 
model  comes  into  consideration,  where  one  would  like  to 
see  slightly  different  projections  and  corresponding  al¬ 
ternative  smooth  fits  in  a  small  neighborhood  of  some 
currently  favored  fit. 

Even  if  a  single  smoothing  method  is  preferred  the 
choice  of  smoothing  parameter  is  rather  delicate.  A 
wide  variety  of  algorithms  yield  (asymptotically)  "op¬ 
timal  curves”  but  these  can  be  quite  different  for  finite 
sample  size,  see  Marron  (1986). 

Summarizing  the  above  situations  we  can  state  that 
the  applied  scientist  will  experiment  with  different  smooth 
fils  and  try  several  alternatives  in  an  iterative  way.  The 
typical  scenario  might  be  described  as  follows.  The  scien¬ 
tist  starts  with  some  initial  smooth  curve  and  then  ez- 
amines  the  graph  and  perhaps  residuals.  In  a  further 
step  he  evaluates  this  information  perhaps  using  prior 
information  on  forms  or  structure  of  the  current  curve, 
then  he  may  want  to  compare  this  current  curve  with 
an  alternative.  This  iteration  procedure  can  be  called  a 
smoothing  analysis  cycle  as  depicted  in  Figure  0.1. 


Smooth 


livaluatc 


Figure  0.1.  The  siiioothiiig  nnaly.sis  cycle 
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This  cycle  might  be  performed  several  times  in  an 
improvisional  way  before  one  or  several  satisfactory  re¬ 
sults  are  obtained  (McDonald  and  Pederson,  1986).  It 
is  obvious  that  one  needs  a  highly  interactive  computing 
environment  to  go  effectively  around  this  cycle. 

1.  XploRe  -  an  interactive  smoothing  environment 

The  computing  environment  necessary  to  perform 
such  experimental  smoothing  falls  into  three  layers 
(Chambers,  1986): 

a)  the  individual  computer; 

P)  the  operating  system; 

t)  the  special  logical  structures  for  smoothing. 

All  three  parts  interact  with  each  other.  Since  hardware 
a)  and  the  system  software  0)  that  goes  along  with  it  has 
become  affordable  even  for  small  institutions  the  discus¬ 
sion  of  what  to  choose  for  optimization  of  a)  and  0)  does 
not  seem  too  relevant  to  us.  In  fact  we  will  present  the 
system  XploRe  as  it  was  developed  on  a  ’’relatively  sim¬ 
ple”  machine,  an  IBM  AT.  The  data  and  program  struc¬ 
tures  7)  for  data  smoothing  and  handling  seem  to  be 
more  important  to  achieve  a  high  degree  of  interactiven¬ 
ess.  They  should  fulfill  the  following  basic  requirements. 

(1.1)  The  interactive  system  should  allow  convenient 
comparison  of  different  fits,  preferably  in  a  gra¬ 
phical  way. 

(1.2)  Certain  viewpoints  or  snapshots  (from  different 
’’angles”)  of  the  data  and  ..s  smooth  should  be 
recordable. 

(1.3)  Results,  summary  statistics  or  verbalized  impres¬ 
sions  should  be  storable  on  the  spot  and  visible 
at  convenience. 

(1.4)  Intermediate  stages  of  a  smoothing  analysis  should 
be  deletable  or  evocable.  Input/Output  to  or  via 
other  layers  of  the  computing  environment  must 
be  po'  'ible. 

(1.5)  A  dump  and  a  reloading  of  the  current  stage  of 
analysis  should  be  possible. 

In  order  to  fulfill  the  above  requirements  we  defined  in 
XploRe  the  following  basic  objects: 


vector, 

viorkunit, 

picture, 

text. 

Vectors  are  the  simplest  objects,  they  contain  an  alpha¬ 
numeric  data  array  of  variable  length.  Workunits  are 
collection  of  pointers  to  vectors  and  may  include  display 
and  mask  attributes.  Picture  objects  are  viewports,  de¬ 
fining  the  location  and  tic  marks  of  the  axes  in  2D  or 
3D  views.  Text  objects  are  sequences  of  text  lines  with 
variable  length. 

In  order  to  fulfill  (1.4)  and  (1.5)  we  defined  the  fol¬ 
lowing  basic  operations  on  these  objects.  Objects  can 
be 

created/deleted; 

activated/deactivated; 

read/v’itten; 

manipulated; 

displayed. 

The  concept  of  the  workunit  object  meets  requirement 
(1.1).  In  its  simplest  form  a  workunit  object  can  be 
thought  of  as  a  data  matrix,  but  the  actual  realization 
as  a  record  of  pointers  to  existing  vector  objects  ma¬ 
kes  it  storage  space  economic.  The  additional  feature 
of  this  object  to  include  mask  and  display  information 
makes  exploratory  techniques  like  brushing  (Becker  and 
Cleveland,  1986)  easy  to  program.  The  display  informa¬ 
tion  as  part  of  a  workunit  object  makes  it  convenient  to 
distinguish  different  functions:  Whenever  the  workunit 
object  is  displayed  (in  a  picture  object)  the  correspon¬ 
ding  display  style  information  (part  of  this  workunit) 
is  used.  This  makes  it  easy  to  remember  different  cur¬ 
ves.  The  mask  part  of  this  data  object  can  be  inherited 
to  children  objects  (e.g.  smooths)  of  a  workunit  and 
makes  thus  tracing  of  interesting  points  through  seve¬ 
ral  steps  of  an  analysis  possible,  see  Oldford  and  Peters 
(1986)  for  more  information  on  this  inheritance  princi¬ 
ple  and  this  object  oriented  approach.  A  graphical  de¬ 
scription  of  workunits  is  depicted  below  in  Figure  1.1. 
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Figure  1.1.  Two  workunits  with  mask  and 
display  information 

Figure  1.1  shows  the  situation  where  one  wants  to 
analyse  a  three  dimensional  data  set  consisting  of  vec¬ 
tors  X,Y,Z.  Workunit  luu-one  consists  of  the  vectors 
X,y,  another  wu-ixuo  points  to  all  three  vectors.  When 
displaying  wu-one  one  could  have  detected  some  inter¬ 
esting  points,  which  one  interactively  has  marked  with 
the  mask  ”7”.  Other  observations  might  have  been  gi¬ 
ven  the  mask  ’’invisible”.  Earlier  one  might  have  decided 
to  see  the  remaining  points  as  stars  ”  *  ”  (except  those 
that  have  mask  ”7”).  Wu-lwo  is  shown  with  square 
and  needles  ”|”  pointing  into  the  {X,Z)  plane  with  no 
additional  mask  options. 

Picture  objects  are  designed  to  meet  requirement 
(1.2)  and  certain  information  about  the  location  of  the 
2D  or  3D  viewpart  on  the  screen,  the  scaling  of  all  the 
axes  and  the  location  of  the  axes  on  the  physical  screen. 
This  object  type  is  resident  until  its  parts  are  changed. 
If  one  displays  a  workunit  object  and  has  found  a  reaso¬ 
nable  scaling,  this  current  picture  object  is  evokable  at 
later  stages.  A  picture  object  can  be  graphically  repre¬ 
sented  as  in  Figure  1.2. 


Figure  1.2.  A  picture  object 

Different  workunits  may  be  displayed  in  different 
picture  objects.  Figure  1.3  below  shows  a  workunit  (poin¬ 
ting  to  the  raw  data)  as  a  pointcloud  together  with 
another  workunit  showing  the  smooth  regression  curve 
both  in  one  picture  object.  A  density  estimate  of  the 
marginal  density  of  X  is  displayed  in  another  picture 
object  (viewport  "picture  2”)  at  the  upper  right  corner  of 
the  screen. 


Figure  1.3.  Two  different  picture  objects 


Text  objects  are  defined  according  to  (1.3).  They 
contain  ASCII  text  lines  of  variable  column  length.  If 
such  an  object  is  displayed  scrolling  forward  and  back¬ 
ward  in  the  actual  text  are  possible.  If  a  text  object  con¬ 
tains  columns  of  data  vectors  (as  ASCII  information)  it 
can  be  converted  into  a  workunit  object  (with  standard 
display  and  mask  part)  and  vice  versa. 
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2.  Sniootlilng  Tcclininuos 

The  basic  operations  on  the  four  objects  liavc  been 
defined  above.  All  these  operations  are  more  or  less  self- 
explaining  so  that  we  concentrate  in  this  section  on  the 
manipulation  of  workunit  and  picture  objects.  The  diffe¬ 
rent  smoothing  techniques  entered  via  this  manipulation 
of  an  active  workunit  are  described  below.  The  following 
lists  are  by  no  means  exhaustive.  XploRe  (1987)  is  an 
open  system,  more  soft  work  can  be  included,  see  section 
3. 

2.1  Regression  Smoothing 

-  Regressogram  (Tukey,  1961). 

-  ik-nearest  neighbor  estimation  (Mack,  1981). 

-  Supersmoothing  (Friedman,  1984). 

-  Kernel  smoothing  (Nadaraya,  1964;  Watson,  1964). 

-  WARPing  (Hardle  and  Scott,  1988). 

-  Isotonic  Regression  (Barlow  et  al.,  1972). 

-  Running  Median  (Tukey,  1977). 

-  Polynomial  Regression  (Shibata,  1981). 

-  Cross-validation  (Clark,  1980). 

2.2  Density  Smoothing 

-  Histogram. 

-  A:-nearest  neighbor  estimation  (Cover  and  Hart,  1967). 

-  Kernel  smoothing  (Rosenblatt,  1956). 

-  (Log)Normal  fitting. 

-  Lj  and  Kullbach  Leibler  crossvalidation  (Matron,  1987). 

2.3  Additive  Model 

-  Alternating  Conditional  Expectations  (ACE)  (Breiman 
and  Friedman,  1985). 

-  Projection  Pursuit  Regression  (PPR)  (Friedman  and 
Stuetzle,  1981). 

-  Recursive  Partioning  Regression  Trees  (RPR)  (Brei¬ 
man,  Friedman,  Olshen  and  Stone,  1984). 

-  Average  Derivative  Estimation  (ADE)  (Hardle and  Sto¬ 
ker,  1988). 

2.4  The  interactive  display 

The  interactive  display  features  of  XploRe  allow  ma¬ 
nipulation  of  both  workunit  and  picture  objects.  Re¬ 
moval,  identification  and  classification  of  points  is  per¬ 
formed  by  pointing  with  a  cursor  to  a  group  of  points. 
This  technique  is  incorporated  in  XploRe  by  the  label 
and  mask  option  of  the  graphics  command  menu,  see 


Figure  2.1.  The  mask  information  will  be  inherited  by 
the  currently  displayed  workunit  object.  By  clicking  the 
’’label”  field  the  cursor  can  be  moved  to  any  point  on  the 
screen.  After  pressing  ENTER  a  window  pops  up  that 
shows  the  index  of  the  observation  (closest  in  Euklidean 
distance)  together  with  the  coordinate  of  the  workunit. 
This  feature  enables  the  user  to  see  all  coordinates  of  a 
high  dimensional  workunit  although  he  might  be  looking 
only  at  one  ’’interesting”  point  in  a  two  or  three  dimen¬ 
sional  projection.  The  ’’mask”  field  allows  the  user  to 
interactively  define  a  rectangle  of  points  which  he  would 
like  to  classify  into  groups  1-9  or  invisible.  The  ’’un¬ 
mask”  option  reverses  this  action,  the  edit  field  allows 
to  change  the  tiemarks  and  the  scaling  of  the  axis  and 
also  the  display  style  of  the  workunit  currently  shown. 
The  movoff  is  a  switch  to  movon  which  means  that  all 
screen  information  is  stored  in  a  movie  fashion  to  disk. 
By  pressing  movie  the  saved  screens  will  be  shown,  this 
feature  allows  tracking  of  past  actions. 


Figure  2.1.  The  interactive  display 

The  viewport  option  allows  the  user  to  map  certain 
sub- rec¬ 
tangles  of  the  screen  to  the  whole  screen.  The  defa- 
xorg  field  is  for  interactive  definition  of  the  axis  origin. 
Clicking  ax  on  switches  to  ax  Ojg'' which  has  the  effect  to 
display  the  data  without  the  axis.  The  six  fields  above 
the  axis  control  refer  to  rotations  clock-  and  counter¬ 
clockwise  around  each  of  the  three  axis  in  3D  space.  The 
two  fields  in  the  upper  left  corner  define  the  distance  of 
the  eyepoint  relative  to  the  pointcloud.  Clicking  suc¬ 
cessively  ”>’’  gives  the  impression  to  come  closer  to  the 
data,  whereas  ”<”  makes  the  distance  bigger.  The  3D 
graphics  have  been  programmed  according  to  Newman 
and  Sproull  ( 1981 ), 

The  edit  field  is  for  locally  changing  the  display  style 
and  for  inheriting  the  current  picture  object  tiemarks 
and  axis  labelling.  Figure  2.2  shows  the  screen  just  after 
clicking  ’’edit”  in  the  situation  of  Figure  2.1. 
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Figure  2.2.  Editing  the  picture  object. 

The  sensitive  fields,  shown  by  rectangles,  show  the 
current  tics.  By  overwriting  in  these  fields  one  changes 
the  layout  of  the  axis.  The  reset  option  gives  a  standard 
view  in  the  cube  [0,maz{i,p,z)l*. 

2.5  Help  information 


Help  files  can  be  attached  by  the  system  program¬ 
mer  through  a  stack  of  "help  windows”.  The  designer 
of  the  computing  environment  determines  at  which  ana¬ 
lysis  stage  which  ’’help  windows”  should  appear.  The 
help  information  is  obtained  by  pressing  FI.  Subsequent 
pressing  of  the  help  key  guides  through  the  stack  of  cur¬ 
rently  attached  help  windows.  The  help  windows  are  in 
fact  internally  handled  as  temporary  text  objects  which 
are  displayed  as  in  Figure  2.3. 


Figure  2.3.  A  help  window 

The  help  windows  (and  also  text  objects)  can  be 
scrolled  backwards  and  forward  by  using  the  I’geDown 
and  Pgellp  key.  All  pulifiown  menus  can  be  folded  and 
unfolded  by  successive  pressing  of  the  FlO  key- 


3.  Installing  own  procedures 

The  system  XploRe  can  be  enhanced  by  installing 
user  written  procedures.  As  an  example  of  how  to  install 
own  routines  we  describe  how  the  running  median  primi¬ 
tive  was  implemented  into  XploRe.  Assume  that  there 
is  already  a  procedure  runmed  {y,n,k,3)  with  input 
array  y,  length  n,  smoothing  parameter  k  and  output 
array  s  (containing  the  running  median  sequence).  An 
optimal  algorithm  has  been  given  by  Hardle  ,  Reinholz 
and  Steiger  (1988).  The  user  chooses  this  manipulation 
by  mouseclicks  and  by  definition  the  manipulation  refers 
to  the  active  workunit  object.  This  workunit  will  then 
be  temporarily  sorted  by  the  first  column  (interpreted 
as  the  predictor  variable  x),  then  the  response  vEiriable 
y  has  to  stripped  off  to  determine  the  running  median 
smooth  s.  It  is  convenient  to  build  a  vector  object  for  this 
output  array  s  and  to  create  a  workunit  containing  links 
(pointers)  to  the  vector  object  containing  the  predictor 
vzuiable  i.  In  XploRe  (respectively  TURBO  PASCAL) 
these  operations  would  read  as  follows. 

p*-ocedure  dorunmed  (wu); 
var 

x,y,s:  workarray; 
n,k:  integer; 

xvec,  yvec,  svec,  newwuobj:  objectid; 
begin 

quicksort(wu); 
getvector(wu,  xvec,  x,  n,  1); 
getvector(wu,  yvec,  y,  n,  2); 
getparameter(k);  {  reads  the  window  size  k 
from  the  keyboard  } 
runmed(y,  n,  k,  s); 

createobj(svec,  vectorparttyp,  ’’smooth”); 
updatevector  (svec,  s,  n); 
createobj(newwu,  wuparttyp,  ’’runmed”); 
inclink(newwu,  xvec); 
inclink(newwu,  svec); 
end; 


The  getvcctor  procedure  extracts  from  workunit  uii 
the  z  and  y  array.  The  creaf''."^’  procedure  create.s  an 
object  of  the  specified  type  (vectorparttyp,  wnparttyj)). 
The  updaU  vector  (inclink)  procedure  includes  an  aTay 
(a  link)  into  vector  objects  (workunit  objects). 
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INTERACTIVE  MULTIVARIATE  DENSITY  ESTIMATION  IN  THE  S  LANGUAGE 


David  W-  Scott  and  Mark  R.  Hall,  Rice  University 


Abstract 

We  have  been  developing  experimental  software  on  worksta¬ 
tions  to  produce  high  quality  color  graphical  representations  of  mul¬ 
tivariate  density  estimates  via  averaged  shifted  histograms.  Some 
of  our  programs  have  been  stand  alone  applications  and  some  have 
been  written  in  larger  systems  such  as  the  S  language.  Part  of  our 
experiment  was  implementing  our  algorithms  in  Becker  and  Cham¬ 
bers’  S  interface  language.  We  discuss  our  experiences  and  try  to 
illustrate  the  results. 

1.  Introduction 

We  have  been  developing  algorithms  for  data  analysis  with 
emphasis  on  graphical  display,  often  innovative  but  non-standard 
in  format.  An  example  of  this  work  heis  been  the  estimation  and 
representation  of  nonparametric  probability  density  estimates  of 
data  in  1  <  d  <  4  (Scott  and  Thompson,  1983;  Scott,  1985, 
1986).  Other  examples  include  nonparametric  regression,  additive 
models,  and  computationally  intensive  algorithms  such  as  cross- 
validation  (Scott  and  Terrell,  1987;  Scott,  1988;  Hardle  and  Scott, 
1988).  All  of  the  algorithms  have  been  developed  in  Fortran  (F77), 
with  custom  programming  of  an  AED512  terminal  for  graphical 
output.  However,  in  the  past  few  years,  we  have  increased  our  use  of 
the  S  language  (Becker  and  Chambers,  1984)  for  data  manipulation 
and  standard  graphics. 

The  question  we  asked  was:  “Would  it  be  both  feasible  and 
effective  to  use  the  S  language  for  development  of  experimental 
algorithms  on  a  UNIX  workstation?”  Some  of  these  algorithms 
would  have  standard  graphical  output,  such  as  x-y  plots  of  cross- 
validation  functions,  while  other  algorithms  would  attempt  to  dis¬ 
play  three-dimensional  density  contours.  Becker  and  Chambers 
(1985)  have  provided  a  mechanism  by  which  any  working  F77  sub¬ 
routine  may  be  installed  into  the  kernel,  effectively  becoming  a 
“new”  S  function.  This  task  is  accomplished  by  writing  an  inter¬ 
face  routine  using  a  C-like  language  and  calling  S-supplied  graphics 
calls  inside  the  F77  routine.  In  fact,  the  built-in  S  functions  them¬ 
selves  are  written  in  the  interface  language  with  F77  subroutines 
for  complex  functions. 

For  anyone  familiar  with  the  S  language  (or  other  similar  lan¬ 
guages  such  as  GAUSS  on  the  PC),  it  is  obviously  desirable  to  have 
one’s  “established”  routines  available  as  S  functions  (Chambers, 
1980).  The  primary  benefits  are  on-line  documeuiatiou,  sluipliuly 
of  input  and  output  data  handling,  and  device-independent  graph¬ 
ics  available  both  inside  S  functions  or  available  for  application  on 
output  S  data  structures.  However,  the  process  of  creating  and 
debugging  interface  routines  can  be  more  than  a  bit  exciting  even 
with  established  routines;  does  it  make  sense  for  expenmen^a/ and 
evohtng  code? 

After  almost  a  year  working  with  workstations,  it  seemed  natu¬ 
ral  to  evaluate  that  experience  and  plan  the  most  productive  strat¬ 
egy  for  our  work.  In  addition  to  using  S,  two  other  approaches  were 
considered.  First,  the  “old  way”  of  writing  large  custom  Fortran 
routines  on  a  mainframe  or  workstation  with  output  to  an  AED 
512  terminal  with  byte-level  control  or  with  output  to  an  IRIS  ter¬ 
minal  controlled  by  calls  to  high-level  graphics  libraries  furnished 
by  Silicon  Graphics.  The  second  approach  is  similar  to  the  first 
but  uses  an  integrated  platform  such  as  a  Sun  3/60  workstation  or 
a  Mac  II  and  a  language  such  as  Fortran,  Pascal,  or  C.  The  second 
approach  was  the  only  alternative  .seriously  considered. 

There  are  several  advantages  of  writing  high  level  language 
routines  directly  rather  than  using  S.  The  codes  tend  to  be  smaller 
and  a  bit  faster.  S  functions  take  much  longer  to  compile  during 
the  debugging  phase.  Moreover,  there  is  less  code  to  debug.  In 
particular.  S  programming  creates  intermediate  Fortran  files  that 
generate  errors  that  must  be  traced  back  to  the  original  interface 
and  Fortran  routines.  This  traceback  problem  is  familiar  to  users 
of  the  Unix  Fortran  preprocessor,  Ratfor. 


The  advantages  of  using  S  are  several.  It  is  much  easier  to 
prepare  test  data  and  input  using  the  wide  array  of  available  S 
functions  and  data  structures.  The  calling  sequences  for  S  func¬ 
tions  are  much  shorter  than  the  actual  Fortran  subroutines,  since 
only  the  input  variables  need  be  specified  (output  variables  are  au¬ 
tomatically  returned  in  a  data  structure)  and  many  input  variables 
can  be  given  common  default  values.  It  is  also  very  easy  to  create 
quick  and  dirty  graphs  for  experimentation  and  then  modify  and 
improve  the  graphs.  But  the  most  important  reason  is  that  coding 
experimental  routines  in  S  minimizes  software  loss.  Have  you  ever 
tried  to  run  an  “experimental”  code  after  a  six  month  layoff  and 
get  anything  useful  from  it?  In  the  summer  of  1987,  I  developed  a 
code  in  S  for  computing  average  derivative  estimates  (Hardle  and 
Scott,  1988)  while  visiting  Wolfgang  Hardle  in  Bonn.  A  year  later, 
while  Hardle  was  an  invited  lecturer  at  Rice,  we  were  able  to  im¬ 
mediately  use  those  routines  on  new  data  he  brought  “cold.”  1 
have  seldom  had  that  experience  in  any  other  language.  So  a  sig¬ 
nificant  part  of  the  advantage  is  that  output  can  be  well-organized 
by  design  to  reside  in  S  data  structures  that  can  be  interactively 
listed,  graphed,  or  analyzed.  Too  often  a  directory  with  an  exper¬ 
imental  Fortran  algorithm  contains  a  bunch  of  files  named  fort.l. 
fort.2,  etc.  Another  S  tool  that  helps  minimize  lost  work  is  the 
diary  file,  which  contains  a  record  of  all  S  session.*;.  Thus  the  ad¬ 
vantages  of  using  S  relative  to  custom  routines  are  significant  for 
an  experienced  developer. 

Of  course,  S  is  not  totally  unique  among  languages  with  re¬ 
spect  to  these  capabilities.  In  fact,  John  McDonald  and  others 
prefer  a  totally  unified  environment  such  as  that  offered  on  a  Sym¬ 
bolics  workstation  (McDonald  and  Pedersen,  1985).  It  is  clear  the 
such  a  LISP  platform  is  powerful  but  S  provides  more  immedi¬ 
ate  productivity  gains  since  it  has  a  feel  more  similar  to  classical 
languages. 

2.  Data  Analysis  via  Density  Estimation 

The  symmetric  positive  kernel  estimate  studied  by  Rosenblatt 
(1956)  and  Parzen  (1962)  has  been  widely  used  to  study  data 

X|,  X2,...,Zn: 


This  formula  is  easily  extended  to  multivariate  data  by  using  a  mul¬ 
tivariate  probability  density  function  as  the  kernel.  A  much  more 
convenient  and  computationally  inexpensive  form  is  the  averaged 
shifted  histogram  (Scott,  1985): 


where  //(h;l)(  )  is  an  equally  spaced  histogram  with  bin  width  h 
and  mesh  location  uniquely  determined  by  having  one  mesh  node 
at  <.  The  ASH  amounts  to  a  weighted  average  of  rounded  points 
(WARP)  and  is  also  easily  extended  into  several  dimensions.  This 
idea  can  also  be  applied  to  a  wide  array  of  nonparametric  and 
additive  models  (Hardle  and  Scott.  1988). 

2.1  Representation  of  Density  Estimates 

The  most  effective  way  to  represent  niiiltivariate  density  esti¬ 
mates  has  been  a  source  of  many  interesting  discu.ssions  and  much 
research.  In  the  case  of  bivariate  data,  we  have  heard  talks  in  which 
perspective  views  of  a  three  dimensional  bivariate  density  surface 
have  been  severly  criticized  relative  to  contour  plots.  Such  a  po¬ 
sition  seems  far  too  extreme.  However,  for  our  purposes,  contour 
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plots  are  preferable  because  they  extend  naturally  into  higher  di¬ 
mensions.  The  display  methodology  we  have  advocated  (Scott  and 
Thompson,  1983;  Scott,  1983;  and  Scott,  1986)  has  been  to  draw 
a-level  contours  where  0  <  a  <  1.  Specifically,  these  contours, 
which  we  shall  refer  to  as  “a-shells”,  are  defined  by  the  sets 

5,,=  {(x6R‘':/(x)  =  a/(ni)} 
where  m  is  the  mode  of  /: 

m  =  are  max  /(x). 

For  trivariate  data,  there  is  one  degree  of  freedom,  namely  the 
density  level  o,  although  the  viewing  angle  might  reasonably  be 
considered  another  degree  of  freedom.  For  quadravariate  data  x  = 
(x,  y,  z,t),  we  display  the  trivariate  shell  satisfying 

S„((o)  =  {(x.y,--)  €  :  /(x,«,x,(|<  =  to)  =  af{m)]. 

Clearly  there  are  two  degrees  of  freedom,  the  contour  level  a  and 
the  slice  intercept  Iq.  For  those  interested  in  animation  of  algo¬ 
rithms,  both  parameters  provide  for  interesting  data  views.  VVe 
have  in  fact  created  movies  of  four-dimensional  LANDSAT  data 
using  this  technique  (Scott  and  Jee,  1984). 

2.2  Example  with  Particle  Physics  Data  Set 

We  begin  by  examining  a  four-dimensional  particle  physics 
dataset  provided  with  the  S  language  (Friedman  and  Tukey,  1974). 
These  data  fall  in  a  relatively  narrow  strip  on  the  fourth  variable, 
as  can  be  seen  by  examining  all  pairwise  scatter  diagrams  of  these 
data  in  Figure  1.  This  plot  was  constructed  by  the  pairs  S  com¬ 
mand.  This  conclusion  is  strengthened  by  an  examination  of  the 
four  marginal  histograms  in  Figure  2.  Since  there  are  500  points  in 
each  of  the  scatter  diagrams,  there  is  a  great  deal  of  overlap  in  the 
plot.  We  can  examine  the  “V”  structure  in  the  variable  2-3  plot  by 
computing  a  bivariate  averaged  shifted  histogram,  which  is  shown 
in  Figure  3  (Scott.  1987).  The  density  plot  indicates  this  is  not  a 
symmetric  “V”  and  fewer  points  fall  along  the  two  rays  than  the 
scatter  diagram  may  suggest.  If  we  slice  the  variable  4  U  into  ten 
bins  and  then  focus  on  the  last  bin  where  t  =  tio  (which  contains 
most  of  the  data),  we  can  examine  the  density  shell  S2o%(^io).  that 
is,  the  20%  contour  shell  of  a  slice  of  the  quadravariate  density  es¬ 
timate  in  the  last  bin  along  the  variable  4  axis.  The  other  axes 
were  binned  into  thirty  bins.  The  continuous  shell  surface  in  three 
dimensions  is  represented  by  a  collection  of  two  dimensional  con¬ 
tour  slices  perpendicular  to  the  x  and  y  axes.  In  Figure  5,  we  have 
superimposed  a  higher  level  den.sity  contijur.  From  these  figures, 
we  conclude  the  data  are  rather  uniformly  distributed  along  this 
truncated  “U”  shaped  core. 

2.3  Example  with  Transformed  Particle  Physics  Data  Set 

Kernel  methods  are  sensitive  to  data  that  fall  or  nearly  fall 
into  subspaces.  Thus  it  may  be  appropriate  to  transform  the  orig¬ 
inal  data  in  such  a  manner  that  the  resulting  marginal  distribu¬ 
tions  are  much  less  skewed.  We  have  done  this  with  the  par¬ 
ticle  physics  data  u.sing  the  respective  marginal  transformations: 
log(zi),  v0og(l  +  Xj),  -\/log(i  -  *3).  -\/log(l  -  I,).  The 
effect  of  this  transformation  is  shown  in  Figures  6,  7,  and  8.  The 
structure  in  these  data  i.s  fairly  clear.  It  is  interesting  to  con.sider 
the  efficacy  of  the  transformation  in  this  case  Without  specific 
reference  to  the  original  purpose  of  these  data,  it  is  difficult  to  say 
more. 

3.  Marching  Cubes  Display  of  Density  Shells 

Oftirnes  it  is  desirable  to  have  a  high  resolution  view  of  a  par¬ 
ticular  density  shell  surface  The  previous  repre.sentatlon  is  more 
useful  for  providing  a  rather  broad  brush  view  of  the  entire  den.sity 
by  examining  several  density  shells  simultaneously  However  using 
techniques  used  in  CAI)-('AM  applications,  the  ASH  shells  may  be 
rendered  as  shaded  solids  A  particularly  nice  forniulation  of  this 


idea  may  be  found  in  Lorensen  and  Cline  (1987).  They  call  their 
method  “marching  cubes.”  A  three  dimensional  triangularization 
is  computed  and  then  displayed  using  a  false-color  algorithm  based 
upon  the  direction  of  a  unit  normal.  In  the  authors’  application, 
a  further  color  smoothing  was  desired  and  w'as  accomplished  by 
a  Gouraud  shading  technique.  Such  a  technique  on  the  averaged 
shifted  histogram  might  also  be  desirable,  but  we  have  chosen  not 
to  do  so  to  emphasize  the  piecewise  linear  nature  of  the  estimator. 

In  Figure  9,  we  show  a  screendump  of  the  triangularization  of 
an  exact  trivariate  normal  density  with  covariance  matrix 


This  is  a  30  by  30  by  30  mesh  and  the  display  is  at  the  5%-!evel. 
Notice  the  visual  discontinuity  is  really  rather  small  even  with  such 
a  relatively  coarse  binning.  On  advanced  color  hardware,  the.se 
surfaces  can  be  rotated  in  near  real  lime.  However,  the  number  of 
triangles  is  so  large  that  a  10-20  MIPS  workstation  is  necessary  for 
real  time  rotation. 

The  data  discussed  in  sections  2.2  and  2.3  were  also  examined 
using  this  tool.  The  ASH  estimates  computed  in  the  S  function 
were  written  to  an  ASCII  file  and  then  input  to  a  C  suntools  pro¬ 
gram  on  a  color  Sun  3/260  workstation.  While  this  was  somewhat 
cumbersome,  it  was  very  efficient  from  an  experimental  point  of 
view.  In  Figure  10,  we  show  the  triangularization  of  the  10%-level 
of  the  ASH  of  the  untransformed  particle  physics  data,  in  Figure 
11,  we  show  the  5%-level  of  the  ASH  of  the  transformed  particle 
physics  data.  Such  plots  provide  a  great  amount  of  detail  at  one 
contour  level,  but  not  at  several  levels  as  before.  However,  sev¬ 
eral  advanced  graphics  workstations  provide  for  transparent  views 
of  surfaces.  Displaying  and  rotating  several  ASH  contours  levels 
simultaneously  is  the  authors’  dream. 
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Figure  2.  Histograms  of  particle  physics  data  variables.  Variables 
I  and  2  are  in  llie  first  row  and  3  and  4  in  the  second. 
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figure  10.  Tri;inguIarization  of  an  a  -  10%-fihpIl  slice  of  tlie  raw  Figur**  11  Triangularization  of  an  a  =  5%-shell  slice  of  ihe  trans¬ 
particle  physics  data.  formed  particle  physics  data. 


Figure  9  Triangularization  of  an  c.  =  5%-^hcll  of  a  trivariate 
normal  density  with  correlations  =  0  8. 
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SMOOTHING  DATA  WITH  CORRELATED  ERRORS 
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ABSTRACT 

Kernel  smooUilng  is  a  coiuinon  luetlioil  of  estnnating 
Ihe  mean  function  in  tlie  nonparanietric  regression  model 

!/  =  /(j-)  +  e 

'.■■her-  j{s)  IS  a  sm-jolli  deierimnislic  mean  function,  and  e 
i'  m  •rrer  i  rocess  ivilli  mean  zero.  In  tliis  paper,  the  mean 
->iuar-  er.'ur  uf  kernel  esumators  Is  computed  for  processes 
i!li  '."rrelated  err..',  and  the  estimators  are  shown  to  he 
mist'-ni  und'  r  restricU\e  assumptions  on  the  seeiuenre  of 
•  r  i'ri-ic»-sses  1  he  standard  leeliinques  for  handwidlh  se 
!•  t-  .n,  su'  li  as  '  rossoahdalioii  and  geiiernlined 
r  \.tli  laHon,  am  slnavn  to  perform  very  hadly  when  the 
■  IT  T'  ar-'  '  orrel.iieil  Standard  selection  techniques  are 
'l.  aMi  to  laveir  iindersniooi  Inn.,  when  tlie  correlations  are 
i  redominautly  positu-  .m.J  uversmoothniK  when  negative. 
11  iwa-ver.  the  selection  criteria  can  he  adjusted  to  correct  for 
'  he  elfect  ..'orrelat  1'  U1 

.Method  of  moments  estimates  ijf  the  correlation  iunc 
M  n  leased  'ui  resi  luals  are  shown  to  l,.e  consistent  when  the 
'  and wj  Jl Jj  i.s  cJjosejj  m  su- )i  a  way  that  the  smooth  is  consis 
t’l.t  ll.avever,  in  Innte  s.nnples  o-.  .-rsiuo' a  hing  h.  ads  to  esti 
:..  iie,  -.-f  i.orrel.ation  wlinh  are  I'.  o  large,  while  undersmooth- 
.1.^  leads  t  o  e,i  iniai  whnh  are  too  small 

Keywords  mean  square  !  err.,r  kernel  smoothing:  corre- 
'.O'  1  I  rr  'ts  haielw  idt  li  ■  ros.,.  c  alidat  loii ,  geiierahre<l  cross 
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1.  Introcliicfion 

Re.rn.-l  -111'  "t  hii.c  1'  a  '  ■  niin  n  iieiii-d  "I  e.,i,i|,aiing 


the  mean  function  in  the  nonparanietric  regression  model 


y  /(x)  +  e 


(1) 


where  /(x)  is  a  smooth  deterministic  mean  function,  and  e 
IS  an  error  process  with  mean  zero.  The  focus  ol  this  work 
is  on  estimating  the  unknown  mean  function  when  the  de¬ 
sign  points  are  equally  spaced  on  [0, 1],  and  the  errors  come 
from  a  stationary  correlated  process.  The  kernel  estimators 
of  Priestley  and  Chao  (1971!)  are  used.  Tliese  have  the  form 

Tl 

/A.n(x)  = 

j=0 


where  the  weights  art 


u'.\(x,;) 


!<(- 


riA 


K  IS  called  the  kernel  function.  A  is  a  smoothing  parameter, 
called  the  l.'andwidlh.  and  >t  is  the  sampje  size. 

Only  kernels  with  the  following  properties  are  consid¬ 
ered: 


A1  K  is  symmetric  about  (), 

B)  K  has  sujijiort  only  on  the  interval  (-7,  j) 

(')  K  IS  Li|iscliitz  Continuous  uf  order  a  >  0. 

K  IS  called  a  k-rmd  oforder/i  if  all  the  first  /j-  1  moments 
■1  l\  ire  (.!.  .llid  the;,"'  lliomelll. 


ig  -  j  x’‘K{-r)ds 
I'  ii  'i  /ero  I  li.'  M|ii  ir"d  norm  of  K 

tt'q  ( /v  "(.r)dx 
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model  (1),  with  observations  taken  at  design  points  = 
The  errors  are  assumed  to  come  from  a  stationary  pro¬ 
cess  with  covariance  function 

-  j|). 

where  the  variance,  ,  is  independent  of  n  and  Pn{k)  is  a 
correlation  function  depending  on  n.  The  variance  matrix  of 
the  errors  will  be  denoted  by  If,. 

The  purpose  of  this  paper  is  to  explore  ‘he  properties  of 
kernel  smoothers,  and  the  use  of  model  selection  techniques, 
such  as  cross-validation  (CV),  (Allen, 1974,  Stone.  1974  and 
Geisser,  1975),  and  generalized  cross-validation  (GCV) 
(Craven  and  Wahba,  1979)  when  the  errors  are  not  indepen¬ 
dent,  but  instead  come  from  a  stationary  correlated  process. 

Figures  la  and  lb  show  realizations  of  the  process  y  = 
cos(.'i.l5ij:)  -h  £  when  the  errors  come  from,  respectively,  a 
Gausstaii  process  with  unit  variance,  and  an  AIl(l)  process 
with  the  same  variance  and  p  =  .5.  The  Gaussian  process 
used  in  Figure  la  was  used  to  generate  the  shocks  for  the 
AR(1)  process  in  Figure  lb,  so  the  resulting  sample  paths  are 
very  similar.  Figures  2a  and  2b  show  kernel  estimates  of  the 
mean  function  for  (his  data  when  the  bandwidth  was  chosen 
using  CV.  For  the  realization  with  independent  errors,  tlie 
estmiate  is  quite  smooth  and  captures  the  mam  features  of 
the  mean  function  For  the  realiz.ation  with  correlated  errors, 
the  estimate  is  far  too  wiggly.  Figures  3a  and  .'(b  show  the 
estim-ites  of  t  he  ni'-.ui  fuiict  ton  for  tins  data  using  tin-  optimal 
(minimum  toUdled  squared  error)  value  of  the  bandwidth. 
'I'he  estimates  for  lie-  mdepeiideiit  and  .\1!(1)  realizations 
are  now  quite  siimfir.  and  '  .ij.ture  the  mam  features  of  the 
>  rue  mean  bmcti' m 
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iiieiits,  the  MSE,  defined  by 

MSE(x,\,n)  =  £■(/),,„ (x)  -/(i))^ 

is  often  used  as  a  goodness  of  fit  criterion  and  as  a  means  of 
assessing  the  asymptotic  properties  of  the  estimators.  The 
optimal  smoothing  parameter  is  often  considered  to  be  the 
one  which  minimizes  the  MSE  totalled,  or  equivalently,  aver¬ 
aged,  over  the  design  points  (TSE  and  ASE,  respectively). 

The  MSB(x,  X,n)  is 

MSE{x,X,n)  =  B^{x,X,n)  +  V,(x,A,ri). 

where  the  bias  term  is 

B(x,X,n)  =  w[[x, •){/{•)- fix))  (3) 

and  the  variance  term  is 

V,  (x,A,n)  =  u('x(x,»)^  u;a(i,»). 

Notice  that  the  bias  depends  on  the  sample  size  only 
via  llie  selection  of  design  points  and  is  not  affected  by  the 
correfation  structure.  For  mean  functions  with  at  feast  p 
derivatives,  and  kernels  of  order  p,  Gasser  and  Muller  (1979) 
computed  the  asymptotic  form  of  the  squared  bias  (when  the 
design  points  become  dense  on  the  tnterval)  to  be 

BUx,X,„}  =  (Af’sA/<'''(x)/p')'  +t‘(A"”)  +  u(i) 

fJ 

wIic'ii  A  -  ♦  0  and  tt\  -■*  oo  and  ^  x  <  1  -  A  Kernel  esli- 
mal'-»rs  arr  asyniplutically  uninased  ui’der  lliese  conditions. 

(■‘>rr»'lalion  of  the  ‘.rrors  alfects  the  llie  variance  term 
wTV  Wltr'u  llir  rrr<jrs  .trc  the  varia/JCe 
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when  p(l)  <  0.  Since  the  bandwidth  controls  MSE  by  trad¬ 
ing  ofT  variance  for  bias,  this  suggests  that,  compared  to  the 
independent  case,  larger  bandwidths  will  be  needed  when  the 
correlations  are  positive,  and  smaller  bandwidths  when  the 
correlations  are  negative.  This  is  also  suggested  by  Figure  5, 
which  shows  a  typical  realization  of  a  process  from  model  (1) 
when  the  errors  are  correlated.  When  the  correlation  is  pos¬ 
itive,  nearby  errors  tend  to  have  the  same  sign,  and  a  large 
bandwidth  is  needed  to  average  them  out.  When  the  correla¬ 
tion  is  negative,  the  errors  fluctuate  rapidly  in  sign,  and  only 
a  small  bandwidth  is  needed.  Since  larger  bandwidths  lead  to 
larger  bias,  this  also  implies  that,  at  the  optimal  bandwidth, 
the  MSE  will  be  larger  correlations  are  positive,  and  smaller 
whCi  the  correlations  are  negative. 

Explicit  evaluation  of  A,n)  makes  these  ideas 

precise.  The  critical  statistic  is  the  sum  of  the  correlations 
(when  it  exists) 
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(a)  Normal{0,1)  errors 


One  consequence  of  this  theorem  is  that  kernel  esti¬ 
mators  can  be  consistent  only  if  5p„  is  bounded  as  n  — » 
OO.  This  crjiidition  is  clearly  not  satisfied  if  the  errors  have 
been  generated  by  a  weakly  continuous  stochastic  process, 
/)„(£,, £j)  =  p(  ).  This  process  has  been  discussed  by  Hart 
and  Wehrley  (1986)  and  Parzen  (19-59  and  1901).  An  impor¬ 
tant  result  from  these  papers  is  that,  if  only  a  single  realiza¬ 
tion  of  the  process  has  been  observed,  there  are  no  consistent 
linear  estimators  of  the  mean  function  as  the  design  points  are 
sampled  more  and  more  densely  on  the  unit  interval.  Parzen's 
results  show  that  the  only  unbiased  linear  estimator  of  fix) 
IS  ijj-  (with  variance  ct‘).  Hart  and  Wehrley  compute  the  bias 
and  variance  of  kernel  estimators,  and  show  that,  despite  the 
lack  of  consistency,  considerable  improvements  (in  terms  of 
mean  squared  error)  can  be  made  by  using  kernel  estimators 
with  A  ;>  I). 

A  second  consequence  of  this  theorem  is  that,  if  5p„  — • 
5,.  as  11  — •  OO,  kernel  estimators  behave  as  they  would  with 
independent  errors  witli  a  different  variance  term  Tins  is 
e.xpressed  in  (.'orollarv  1.1,  below. 

Corollary  1,1:  Suppose  5,,,,  Sp,  Let  Cj-  be  the  process 
(jeneraled  by 

=  /(ir)  +  Uj. 

uhere  the  errors  u,-  are  independent  with  lariance  cr"’(l  t- 
2,Sp),  and  z  hus  the  same  mean  junction  as  y.  Then,  under 
the  conditions  of  Theorem  1,  asymptotically,  as  -\  —e  1,1  and 

II  \  ■“*  OC', 

A/5E.,(.r„.,,A,n) 

.\/.S£-,(.T,.,,,A,n)  ^  ' 

■As  a  ‘■...n.sequeiice,  the  asymptotically  optimal 
b.ni'lwidth.  A('i/).  fi.q  '-stimatiiig  /  from  y  is  the  same  as  esti- 
nialmg  fr'.'in  r  Let  Zr  be  the  process  generated  by 

Z;  -  fixi^Up 

V, her'-  tin-  '-rrors  V,  ar"  iii'.l'-p'-adeiil  with  variance  a'  and 
I'-i  Ai'/f)  be  the  asvmploticalK  optimal  bandwidth  for  es- 
iiiii.-iiing  /  Iroin  Z  H  .5.,,  ■  'b  th'-ii  Aiy)  .>  \\Z\  and 
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Figure  3:  Smoothed  eshmalo  and  moan  tunclion  for  y=cos(3.  tSnxjer  Minimum  lal.illod 
squared  error  war,  used  to  P'Ch  the  smoothing  parameter 
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prediction  error  (ESPE)  is 


Corollary  1  also  shows  that  the  results  of  Gasser  and 
Muller  (1979)  about  the  shapes  of  optimal  kernels  continue 
to  hold  when  the  errors  are  correlated. 

3.  Selecting  a  Smoothing  Parameter 

For  a  given,  finite  set  of  observations,  choice  of  an  ef¬ 
fective  smoothing  parameter  is  of  considerable  interest.  A 
"good”  value  of  the  smoothing  parameter  will  result  in  a  small 
value  of  M5£'(a:„,,,  A,ri)  . 

Several  criteria  based  on  the  data  have  been  used  for 
bandwidth  selection.  Those  most  commonly  used  are  CV, 
(Allen, 1974,  Stone,  1974  and  Geisser,  197.5),  GCV  (Craven 
and  Wahba,  1979),  and  Mallows  C/,  (Mallows,  1973).  The 
properties  of  these  criteria,  including  convergence  of  the 
smoothing  parameter  chosen  by  one  of  the  selection  criteria 
to  the  truly  optimal  value,  and  the  asymptotic  equivalence 
of  the  the  criteria,  have  been  explored  in  some  detail  for  the 
independent  case.  As  Figure  ‘1  deinonstrates,  these  criteria 
perform  well  when  the  errors  are  independent,  but  perform 
very  poorly  when  the  errors  are  correlated. 

CV  is  based  on  the  "deleted"  residuals,  y,,,,  -  /,v,„;(^) 
where  „  ;(jr)  is  the  estimator  which  does  not  use  y„,,.  Fig¬ 
ure  6  provides  a  heuristic  argument  for  the  failure  of  CV  when 
the  errors  are  positively  correlated.  In  this  case,  the  errors 
for  data  near  x„,,  ace  lend  to  have  the  same  sign  as  the  error 
of  yn,i.  As  a  result,  /\  „  ;(x)  lies  closet  to  y^.i  than  /(x)  and 
the  “deleted”  residual  is  too  small  as  an  estimator  of  the  true 
error.  As  a  result,  CV  underestimates  the  variance  of  the  es¬ 
timator,  and  tends  to  pick  bandwidlhs  which  are  too  small. 
,‘Vs  Theorem  9  demonstrates,  the  converse  is  also  true.  If  the 
correlations  are  negative,  CV  overestimates  the  variance,  and 
tends  to  pick  bandwidths  which  are  too  large, 

•Mallows'  Cl.  (.'V  and  Gt.'V  can  all  be  viewed  as  estima¬ 
tors  of  squared  prediction  error,  based  on  a  correction  to  the 
observed  squared  residual.  The  prediction  error  at  a  point, 
x„,i.  IS  the  difference  between  a  putative  new  realization  of 
the  process,  and  the  smooth  based  on  the  actual  observa¬ 
tions.  'f  ile  errors  in  the  new  observations  are  iiidepeiident  of 
errors  in  the  original  observations,  so  the  e.xpecled  squared 


ESPE[x„,i,X,n)  =  E(y„ew{x„,i)  - 

=  +  MSE(xn,%,X,n). 

The  residual  at  Xn.i, 

r(x„,,,A,n)  =  lli„  ,  -  jx,n{Xn,i) 

is  a  natural  estimator  of  prediction  error,  but  the  squared 
residual  is  biased  as  an  estimator  of  ESPE(x„,,,X,n)  as  it 
has  expectation 

£'(r='(x„,.,A,r.))  ^  +  MSEix„,„X,n)  (4) 

-  2a^tUA(x„,,  ,i)  -  V2(i„,,  ,  A,rj). 

The  term  2u^wx{x„,i,i)  arises  because  yi.„ ,  is  both  a 
term  in  the  estimator,  /(x„,,)  and  the  estimator  of  y„,.„,(in.i  ). 
The  additional  variance  term, 

V2l.^n.,,X,n)  =  2<r^  ^wa(x„.,,i  +  j)PnU)- 
j/o 

arises  because  of  the  correlation  between  e„,,  and  the  other 
errors. 

Cl,  CV  and  GCV  can  all  be  viewed  as  adjustments  to 
the  squared  residual  which  correct  for  2o-^u/v(xn,,,«).  .Mal¬ 
low’s  Cl  is  defined  by 

'•r/.fxn.,,  A,r»)  =  r'^(xn,,,  A,ii)  -I-  ‘la^\ux(x 

where  is  some  unbiased  estimator  of  <7^,  (In  Mallows' 
original  paper,  the  criterion  is  divided  by  <7^.)  For  band¬ 
width  selection,  the  criterion  is  usually  totalled  over  all  the 
design  points  and  the  value  of  the  smoothing  parameter  which 
minimize.s  this  sum  is  selected.  However,  the  theoretical  com- 
piitalions  III  this  section  are  done  pointwise 

When  the  errors  are  independenl .  Vr(,r„., ,  A,  11 )  P.  <inl 
i  j  Is  unbiased  (or  I.SI'L.  However,  (.r.,  , ,  A,  n  )  ran  be 
b.vdiv  bi.isej  lor  ,  A,  n )  if  V.)(.r,i,i ,  A.  n  )  |,ari;e 

.M.dlou  r.'r  1-.  in<  onveiiieni  lo  u-..  jo|  I .  n,,  I  nidi  h  ve].  , 
iioii.  p.iri  I'  111  11 1 V  when  lie-  error'  are  .  a  irrel  at  ed .  ler  .in-.  . 
tie  dilln  lilt  >  In  lind  nm  rood  e'l  iinal  I  ii-.  o|  o  ■  ' .indtd'X 

lie  id  lu't  Ml' 111  s  to  tie-  r'-'idinil  'Uin  o!  'nn.ire-  whih  hio 


Figure  4;  Variance  term.  V .  tor  the  uniform  kernel 
with  AR(1)  errors,  various  values  of  ii(1j 
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the  same  asymptotic  expectation  as  and  do  not  require 
an  estimate  of  the  variance. 

Algebraic  manipulation  shows  that  the  CV  criterion  is 


rev  (^n, 


r^(jn,,,  A,n) 

(1  - 


GCV  was  proposed  by  Craven  and  Wahba  (1978)  as  an  ad¬ 
justment  to  cross-validation  that  is  more  nearly  unbiased  for 
E'SPE  in  the  case  of  unequally  spared  points,  if  the  design 
points  are  considered  to  be  fixed.  The  GCV  criterion  is 


rc;C' V  (^n.t  I  ri) 


A,ri) 


where  fVT.n  is  the  matrix  [ui,\(i„,,,y)]  and 
trWx.n  =  sr=0  w.\(a:„,.,«)-  If  ^  >s  small.  ~ 

un(jr„,,,«)  R5  so  CV  and  GCV  differ  very  little. 

I-'or  A  — >  0  and  riA  —>  oo,  M S E{x„,,,\,n)  =  O(^)  + 
0(A^''),  while  Lemma  2.  below,  shows  that  A,n)  = 

Olply).  Using  a  Taylor  series  expansion  for  CV  or  GCV,  the 
expectation  of  the  criteria  are; 


£'(r(^^)Ci-(a:n,n  A,n)  =  -f  AfSEfi,,.,,  A,fi)  (0) 

-  f'2(x7,,i, A,n)  +  o(A^'’)  +  o(-!r). 

n  A 

So.  asymptotically,  C/,  ,  CV,  and  GCV  have  the  same  expec¬ 
tation  for  equal  y  spaced  design  points.  Theorem  2  describes 
tiie  behavior  of  this  expectation. 


Leinina  2:If  the  kerret  function  satisfies  conditions  A-C,  and 
the  correlation  function  satisfies  condition  D,  then  fur  y  < 

V'fx,..,,  A,fi)  =  +  of— r)- 

n  A  n  A 


Theorem  2:  Under  the  conditions  of  Theurcrii  I,  the  tuFionce 
term  for  Ci_  ,  CV,  and  OCV  is 

Vj(.r,.,,,A,ri)  -  tb(  ,,A.ri)  <t' ( 1 -f  25,.(T  -2—^, — ^l) 


The  details  of  the  proof  of  Lemma  2  are  in  Altman,  1988. 
Theorem  2  follows  simply.  Theorem  2  is  also  true  for  suitable 
boundary  kernels. 

Most  kernel  estimators  commonly  in  use  have  K(0)  > 
K(x),  so  that  K(0)  >  W/v.  Let  A’(y)  be  the  bandwidth 
chosen  by  one  of  the  selection  criteria  for  estimating  /  from 
!/.  and  X’(Z)  be  the  bandwidth  chosen  when  estimating  from 
Z  of  Corollary  1.  (Recall  that  Z  is  ^  process  with  the  same 
mean  and  variance  as  y  and  independent  errors.)  Theorem  2 
suggests  that  if  Sp  <  0,  then  A(y)  <  XlZ)  «  X‘{Z)  <  A’(y). 
If  Sp  >  O.then  A(v)  >  X{Z)  res  X'lZ)  >  A*(v).  In  fact, 
if  25p(l  —  2 )  <  0.  then  the  criteria  tend  to  be  strictly 
increasing  with  A,  and  so  they  will  favor  interpolation.  This 
is  supported  by  the  simulation  results  reported  in  Altman 
1987  and  1988. 

4.  Correcting  for  Correlation 

Theorem  2  establishes  that  bandwidth  selectors  perform 
poorly  because  they  do  not  fully  correct  the  residual  sum  of 
squares.  In  this  section,  two  methods  are  suggested  for  cor¬ 
recting  the  selection  criteria  when  the  correction  function  is 
known.  The  direct  method  adjusts  the  criterio  to  make  them 
more  nearly  unbiased  for  ESl’E.  The  indirect  method  trans¬ 
forms  the  residuals  to  produce  transformed  residuals  vvhich 
are  less  correlated. 

If  the  correlation  function  p„  is  known,  with  correspond¬ 
ing  correlation  matrix  R,,.  Mallow's  C/,  can  be  corrected  to 
b"  an  unbiased  estimator  of  ESPE. 

from  equation  (4)  an  appropriate  adjustment  for  .Mal¬ 
low's  Cl.  criterion  is 


rrt..p{crn.,,X.n)  =  r*’(x„,, ,  A,  n)  (fi) 

1: 

+  20^  v;  UJ\(x„.,,t 
J=-k 

The  corresponding  adjustments  for  CV  and  GC\'  are  in¬ 
tended  to  match  the  low  order  terms  in  the  Taylor  series 
expansion  in  equation  (.Ti  to  the  adjusted  C/,  criterion  in 
equation  (b)  One  way  to  do  tins  is  to  set 


C  l  ,  A,  n )  ^ 


r‘(  j-,,., ,  A,  ri ) 


.  ^U'i(,r.  .,,1-1-  j)/i,  (j))-' 


Figure  5:  Positively  correlnicd  errors  require  torge  oandwidihs  to  average  to  zero,  while 
negatively  correlated  errors  require  only  small  ones 
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and 


''GCV.pC^n,!,  A,n) 


r^(xn,i,  A,n) 


We  will  call  this  the  direct  method  of  correcting  for  correla¬ 
tion,  and  denote  the  corresponding  bandwidth  selection  cri¬ 
teria  by  CVp  and  OCV^  respectively. 

Another  approach  to  the  problem  when  the  correlation 

matrix  is  known,  is  to  compute  the  transformed  residuals: 
_ 

A,n)  =  Rn  ’r(*,A,r»).  This  has  been  used  with  some 
success  in  the  context  of  spline  smoothing  with  normal  AR(1) 
errors,  (Diggle  1985,  Diggle  and  Hutchinson,  1985,  and  Engle 
et  al.,  1986).  The  goodness  of  fit  criterion  is  then  the  total 
weighted  MSE, 

TSE,-.  (A,n)  =  -  /(•)) 

=  trB(.,  A, A, -h  <7^trWl,„  Wa,„ 

where  B(«,A,n)  is  the  vector  of  biases  defined  by  (3). 

The  totalled  criterion  based  on  the  transformed  residuals, 


-Er„'-Jxn.„A.n)  +  2.^^  tr  W,,„, 

1=0 

is  then  unbiased  for  .ite  expected  value  of  TSB^-i  (A,n). 

The  C\'  and  GCV  criteria  based  on  the  transformed 
residuals  can  also  be  readily  defined.  They  are 


.iiid 


rp-i(in.i|A,ii) 
( 1  - 


'■/.ri',,.  ' 


r!-i(ii..i,A,ti) 
^  ir  ■ 


W'h  vviJ]  t  ail  i|)j>  ili»‘  jii'.lir'.'Cl  ini.‘t.liu'.l  of  corr^ctm^  for  c-’T- 
r'-liiiuii,  aiiil  ill'.-  'I  l.iaiid  width  st'.'h'Ci  mu 

•rii'  rii!>>  .  I  .iii'l  i  r-.‘.'.|HM:tiv»dy. 

{’ !■ -ii  :-■[-< 'Tt -■■}  111  Aliiii'oi  r,JS7  afi'j 

Imi|;i  '  rif'Ti  »  uh^u  th'-  corr'd.o j"U 

lujJ'  H  -  'll  1'  k!l  11. 


5.  Estimating  the  Correlation  Function 


Usually  the  correlation  function  is  unknown  and  must 
be  estimated  from  the  data.  Theorem  3  below,  shows  that 
the  method  of  moments  (MM)  estimator  of  Pn(»)  >s  consistent 
under  mild  regularity  conditions  on  the  errors.  The  corollar¬ 
ies  explore  the  nature  of  the  b'as  for  finite  samples. 

Theorem  3:  Suppose  the  mean  function  hat  derivative 
which  it  Lipschitz  of  order  7,  and  the  kernel  and  correlation 
functions  satisfy  the  conditions  of  Theorem  1.  For  fixed  s, 
define  the  method  of  moments  estimator  of  pn(s)  by 


p„(r,A)  = 


'•=[=^1 

Then,  as  X  ~t  0  and  nX  —>  00 


L.-raii 


tvhere 


C(A,n)  =  l  +  A^'>  (5) 

(l+25p) 

riA 


f(fM(^]fdx 

(Wk  -  2K(0)). 


(7) 


Suppose  in  addition,  the  errors  are  fourth  order  stationary. 
£et  K.i,n(r,  s,  0)  be  the  fourth  joint  cumu/ant  of  the  distribution 
0/ e„  ,t+r,e,,j+,,e„,t  +  r+s)  assume  that,  for  u  suffi¬ 
ciently  large,  and  for  all  r  and  s,  ki.„(r,  s,0)|  <  00. 

Then 

V’ar(p„(l,A))  =  0(i) 

tlA 

111  any  tiiiil*.*  lii»*  eslinuitur  is  bia&ed.  SiiiCr  tlu*  iii- 

j'au  ahuijl  Jh'*  i. urrfiat j<jJj  j.>  ijj  i)jh  errors.  ii  js  U'U  sur 
priMiiki  llial  til'-  I'lav  ut.  wh'-n  th*-*  sii;nal  tn  iimis**  ratiu 

1''  Miitl!  W  ii».‘ii  th*  M-Aiiai  i'.'  U''!""  r-uio  is  larj;'*,  haiiJwidi  h 
1  Mil  '  .tn  r'-.tdil\  !>-•  .!■  >11'-  !>>  '->*•  i',sluiiat'-.s  >.'f  i  !i--  '"rr" 
lj>)'  ijs  ar*-  uh'-n  l)i'>  aiv  ni''st  . . I'-'l. 

t'oiiillnry  'LIi/h.-./M  (Lt  ivudiltotih  0/  'rhiouui  >. 

{‘■{U  cll'j, 

iij  !'ht  "//  il.\)  15  u  fnuihi.rt  t/f  (h>  t<  r  riit 

, 


Figure  6:  When  tne  errors  are  positively  correlated,  the 
cross-validation  estimatoi  lies  too  close  to  the 
data  As  a  result,  the  'deleted'  residuals  aie  too 
small,  and  CV  is  biased  down 
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1)  If  2K(0)  <  Wk,  the  bias  of  is  positive. 

c)  // 2K(0)  >  Wa’,  p„(1,A)  has  bias  which  is  increasing  in 

X,  and  is  decreasing  in  5p. 


Proof: 

p„(l)  +  C(A,n) 

1  +  C(A,  fi) 

where  C(A,rj)  is  defined  by  (7).  E{pn{i.,X))  is  an  increasing 
function  ofC(X,n).  // 2A'(0)  <  W^-,  then  C(A,n)  >  0.  If 
27\(0)  >  W/i,  then  C(A,ri)  is  an  increasing  function  of  X, 
and  of  the  signal  to  noise  ratio  and  a  decreasing  function  of 
5„. 


From  these  computations,  it  is  possible  to  compute  the 
bandwidth  which  is  asymptotically  optimal  for  minimizing 
the  bias  of  the  MM  estimator.  For  kernels  which  have  a 
maximum  at  K(0),  this  bandwidth  is  not  the  asymptotically 
optimal  bandwidth  for  estimating  the  mean  function.  This 
leads  to  the  tentative  conclusion  that  techniques  which  iter¬ 
atively  compute  the  mean  and  correlation  function  may  not 
converge  to  the  true  mean  and  correlation  function,  llowever, 
bimodal  kernels  may  have  some  promise  for  this  situation. 


In  practice,  choice  of  bandwidth  does  not  appear  to  be 
very  sensitive  to  the  estimated  correlation  (Altman,  1087  and 
loss,  Fugle  et  al,  1980),  In  consequence,  a  simple  two  step 
procedure  often  performs  well.  First,  estimate  the  correla¬ 
tions  from  the  residuals  from  a  moderate  bandwidth  smooth. 
Then  use  the  estimated  correlations  to  pick  a  bandwidth  for 
estimating  the  mean  function  /. 


For  comparison  purposes.  Table  1  shows  an  example  of 
tlie  results  from  the  siiutikttiori  study  in  Altriian,  1087.  Fifty 
realizations  were  taken  of  a  sample  of  size  128  from  the  pro¬ 
cess  y  —  cos(3.1,')irx)  +  e  where  the  the  errors  e  was  a  .Nor¬ 
mal  .AH(1)  process  with  variance  1.0.  The  quadratic  kernel, 
/\(a')  =  h  used.  The  number  under  the  head¬ 

ing  'Tiiiii  .ASF"  IS  the  median  over  the  .70  realizations  of  the 
average  squared  error  (.ASF)  loss  at  the  bandwidth  minimiz- 
mg  .ASF  for  that  realization.  Tiie  numbers  under  the  heading 
"min  IS  the  median  of  the  ratio  of  ASF  a'  the  bandwidlh 
selected  l.iy  (’\’  to  the  niinimum  .ASF  for  that  realization.  Ihe 
number-  under  "miii  CV..  and  "min  CVp"  are  similar  ratios 
for  the  direct  correction  to  (..'V'  with  the  true  and  eslmialed 
values  of  the  correlation  function. 


Table  1 


u 

mill  .ASF 

mill  CV  mill  CV,, 

min  CV/. 

It 

.135 

2  93 

1  12 

Ill 

li 

19.8 

1,19 

1  12 

1  97 

li 

.951 

1.99 

Tilt; 

1  11 

9917 

1  39 

2  13 

117 

Ih- 

r'-snll-  uei 

e  v.-ry  -mular 

f.  r  all  the 

kernels  iise.l 

111  the  sU.idy  Hesul's  wi  r.-  also  very  goorl  when  ihe  signal  to 
noise  r.il  lo  was  very  l  arge  (error  v  an  iii<  e  (II  )  eien  I  hough  I  lie 

estiiiiales  of  correlalion  ,i  eery  |  irsnioiiii' ••i|  upwards 

bl.is 
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ABSTRACT 

I.et  //  1)0  a  smooth  function  defined  on  an  interval 
[a.b],  and  stipitose  that  .''p' arP  uncorrelaled 

observations  with  E(yj)  =  /'(IjK  1  <  j  <  n.  where  the  M 

are  fixed  equally  spaced  |)oints  in  [a,b].  Estimation  of  // 
and  its  derivatives  by  regression  on  trigonometric  and 
low  order  polynomial  terms  is  considered.  The  poly¬ 
nomial  terms  are  shown  to  adjust  for  tint  boundary  bitts 
liroblems  known  to  be  suffered  by  regression  on  trigono¬ 
metric  terms  alone.  .‘\s  a  result,  the  estimator  of  )i  and 
its  derivatives  obtained  by  this  mt'thod  is  shown  to  be 
comix'titivf'  with  other  nonparaiiKU ric  estimators.  The 
method  is  illnstrated  by  ('stimating  two  growth  curves 
and  I  heir  derivative's. 

1.  INTRODlJCriON 

.■\  problem  that  arises  in  nonparametric  rt'gression 
analysis  is  the  estimation  of  functionals  of  a  regrt'ssion 
(iirve.  Out'  particnlarly  important  examph'  is  esii- 
tiiiiting  a  deri\ative  of  some  order.  Tints  iti  this  pa|)er 
we  investigate  the  properties  of  a  simple  new  tnethod 
for  (h'livative  estimatioti.  It  will  b<‘  shown  that  deri- 
viitive  estiniiition  by  regression  oti  a  cotnbimit ion  of 
polynomial  and  t rigotiotnet ric  funct iotis  provides  ojiti- 
tn;d  tail's  of  convergence  with  resped  to  average  mean 
square  error.  1  hese  rt'sulls  extend  work  on  function 
estiimition  by  Eiiluink  and  Speckinan  (I9SS)  to  deri¬ 
vative  estimation  under  the  assumption  of  eipially 
spaced  observal  ions. 

■Assume  that  observations  are  taken  acrording  to  the 
model 

=  //(t.)  a  e.  1  <  i  r  n. 

where  the  i.  are  /ero  mean  uncorrelaled  random  vari 


.•\  natural  estimator  of  the  mth  derivative  of  fi  can  be 
obtained  by  differentiating  (1.1)  m  times.  W'e  will 

label  lh('  result  /;^'”^(t)  and  term  it  a  polynomial-trigo¬ 
nometric  regression  (P'l'H)  estimator  of  p^'"^(t).  This 
is  the  estimator  to  be  studied  in  the  |)resent  paper. 

The  proposed  method  is  motivated  by  the  following 
ob.servations.  It  is  known  (sec  Eubank.  Hart,  and 
Speckinan  (1087).  for  example)  that  estimation  by  a 
trigonometric  scries  alone  (i.c.  with  no  polynomial  part 
in  (1.1))  has  optimal  convergence  properties  if  //  has  d 

derivatives  with  //^’’'^(O)  =  //^"'^(l).  0  <  m  <  d.  'I'he 
problem  with  the  [itire  trigonometric  series  is  the  fact 
iliiit  the  Rt  is  necessarily  periodic  while  the  true  re- 
sponsr'  function  /r  need  not  be.  This  can  result  in  seri¬ 
ous  bias  problems  at  the  boundaries.  lIowe\-er.  siqipose 
that  p(i)  is  the  unique  polynomial  of  degree  d  such 
that 


p<"')(l)-p("»(n)  =  ,,('>')(l,_;/",)(,,|^ 

U  <  m  <  d,  {\:2] 

and  let  //(t)  =  p(i)  -f  //(|(t).  1  hen  //jj  has  the  rerpiisiie 

boundary  properties  for  good  estimation  by  a  irigo- 
noinetric  serii's.  Ileurist ically,  the  polynomial  ptirl  of 
I'TR  estimates  p,  and  tin'  trigonometric  |)arl  effec¬ 
tively  models  /Iq.  W'e  make  these  observations  precise 

in  1  he  next  sec'  ions. 

1  he  performanci'  of  me  derivative  estimator  obtained 
from  (1.1)  will  be  rni'asured  by  the  average'  mean  sipiare 
error  of  estiniiition. 


.\M.8E(Al 


i  = 


.lilies  with  constant  Viiriatice  o’.  I'nriher  assume  that 
the  l|  .lie  eqiiiilly  s|)aced.  iind  without  loss  of  geneiidil \ 

lid'.e  t|  -  li  1  )/n.  1  <  i  <  n.  .\n  esiiimitor  for  //  ihiii 
was  proposed  by  Eubank  and  Speckman  (Ht.s.s)  is 


A 

li.l-'  +  E  (c,  COSg'Tkt 

•'  k  1 


f  Sj^sin'Jrrkt ). 


II. I) 


where  the 

OM'I  H  .,  ( ' 

.1 


b..  c,  .  and  s,  are  oblaini 
I  k  k 

1^.  .Old  S|^  I  he  (plant  it  y 


•d 


by  miiiimi/ing 


Our  principle  result  about  the  properties  of  is  tin' 

following. 

Till  OH  III.  If  /I  has  d  -  1  absolutely  eoniiiiuoiis 

deri\  ill  i  ves  with  '  C  l.“ll).  1  ]  and  II  <  m  <  d,  then  as  n 

‘  > 

■  .  and  A  -  ,i  in  sin  h  a  way  t  h.il  .\“/n  •  li. 

.AMSElA)  ()( A  ^  i  ()(A-"'  *  '/ni. 

In  pan Hajlar.  lor  A  --  ()( ii  ). 


ii 


<i  A 

A  li  t.'  -  E  IC,  (OS •jTkl.  i  S,  .-in-ATkl-ir. 

,1  1  '  '  k.  1  '  '' 

I  11  1  e  I  he  terms  d  and  A  are  "smoot  hing  parameters" 
III  be  (  ho.eii  hy  the  user,  A  workable  slriilegv  is  to  fix 
(I  .ii  .1  -mall  Millie  (ly|ii(ally  '2  or  A)  .iiid  allow  A  to 
Mi.\  with  n  t')  oblain  a  suitable  fit  ((f.  E.nbaiik  and 
Spei  kiiian  (  I  'isN )  I. 


.AMSElV.i  0(11  ""'''-''i  '  ''i. 

h'liiiiiik-.  When  A  OliA  ^'i.  I  he  re.-ullmg 

i.iieofde  ly  of  .A .M SI  .1  \  I  is  1  he  "opt  ii i i.il"  uniform  rale 
of  ( on\ eri'eiiee  foi  deiiv,ili\e  esiinialion  in  the  (  l.iSs  of 

linn  lions  C  L.jO.l]  as  sliown  li\  Slone  (IUSl’i 

Ellis  r.ile  Is  aihii'M'd  by  ,1  gie.il  m.ne.  lionp.ii  .mieli  ic 
e-l  im.llols. 

We  believe  tli.il  the  iissiimpliou  o|  eipi.illy  ^p.ued 


points  can  be  weakened  substantially.  In  Eubank  and 
Speckman  (1988),  we  were  able  to  obtain  the  results  of 
the  theorem  for  the  case  m  =  0  assuming  only  that  the 
tj's  were  a  sample  from  a  distribution  with  |X)sitive 

bounded  density  on  [0,lj.  We  have  been  able  to  extend 
the  methods  used  there  to  obtain  good  bias  bounds  for 
derivative  estimation,  but  at  present  we  are  unable  to 
gel  a  satisfactory  estimate  of  the  variance  in  the  general 
case.  However,  we  conjecture  that  the  hypothesis  of 
equally  spaced  points  is  not  necessary  for  derivative 
estimation. 

In  the  next  section  we  discuss  some  preliminaries, 
establish  further  notation,  and  derine  a  particular  basis 
for  the  polynomial  part  of  the  regression.  The  i)r(X)f  of 
the  main  restilt  is  then  sketched  in  .Section  .'J.  An 
example  is  presented  in  Sectioti  1. 

2.  PRELIMINAfUES 

To  begin,  note  that  o'ay  be  obtaitied  by  using 

OhS  to  fit 


d  A 

/<,(t)  =  ^  b.p.(t)  +  a,exp{2;rikt}. 

''  j=l  J  J  k=-A 

(■2.1) 


where  jl,  i)j.  j=l.'-',d) 


is  any  linearly  ind('pendent 


system  of  ixtlynomials  spantiitig  (I.  -  ■  -  .t'  }.  Iliis  reprt^ 
sent  at  ion  is  useful  because  it  is  easier  to  work  with 
analytically,  and  we  will  frt't'ly  use  the  fact  that  the  left 
hand  side  of  (2.1)  is  rt’al.  .A  particularly  (oiivmiient 
basis  for  the  polynomial  term  is  ohiaiinxi  by  defining 
[).(f),  1  <  j  <  rl,  to  be  the  uni(|ue  polynomial  (T  degree*  j 


siitisfving 


conjugate  transpose.)  Now  let  pj  =  {|)j(tj),-  ■ 

and  define  p^^^^  =  A-*  ~  (1  -  (1  reason 

for  the  normalization  by  A^  will  become  clear  in 

Section  3.)  'fhe  pj^j^  are  orthogonal  to  the  Xj^.  With 

=  {P]„\-'  ■  ■*P(]n\J'  tnatrix  with  columns 

it  follows  that  T^^  and  = 

— 1  *• 

X.,  (X.,  X.,  )  X.,  are  orthogonal,  and  t  he  solution  to 

2n'  2n  2n'  2n  r- 

the  least  squares  problem  giving  i2.1)  can  be  expressed 

as 


/eA  =  (/'A('l)'----''A(’n)’'  =  ‘'‘nA  +  ‘’iiA'^' 

To  examine  the  behavior  of  the  derivative  estimate, 
we  need  a  representation  of  the  fiiintion  defined 

on  [O.lJ.  Let  a=  (a_j^.- •  •  .a^)^  =  ir'X||^y.  and  define 
r^^^y(t)  .=  ^]i<|<\  a|,c'-'<|)(2Tikt }.  Eor  a  function  f(i) 

on  [O.lj.  the  projection  T^^  ^f(i )  will  be  defined  by  taking 

y'  =  (f(t|).- • -.fltij)).  With  this  notation.  i).||^(t)  = 

A-'  “  ’^-’(1  -  ’',,\)Pi(')'  (^<''<'".5  =  (b|.- • -.b^l)’  = 

(X.,|X.^ii)“'x.^iiy.  w('  thus  obtain 

d 

p^lt).  ^j^lVjnA'"- 

Itepeated  different iai ioii  then  gives 


/  p.(t)dt=0. 

•'ll  ■' 

p'.'"Vl)  =  I ).  P  <  111  <  j  -  2  for  j  >  2. 

p|''“’^(l)  -p'j'''’(0)  =  1. 


It  (an  be  shown  t  hat  pit)'  H  ( I  )/ j!.  w  here  H  |l  I  )  is  t  he 
ji  li  liernonlli  polynomial. 

l  or  sim])li(ily.  we  will  assume  llial  n  is  odd.  I  he 
(Use  of  n  even  is  similar  I  hen  the  \e(lors  Xj^  - 


( I  .exp{2Tik/n  ■  .exp(2-ik(n  l)/n))'.  k 


n  A  ,  V  es|i{ -2rik( r  l)/ii).  If  we  let  T  .v 
I  1-  r  '  '  'I  iiA- 

(lie  iirojeition  of  y  onto  the.  pan  o(  (xj^:  'k!  •' 
ort hogmialil y  of  the  X|^  implies  i  na t 


-  I). 

for  ik". 
^'k 

denote 
\\.  the 


f 

n 


A^ 


'X,  X,  y. 
In  lie 


(2,Tik)"'a.  e.\p{2-ikl  ) 

|k|<A 

Now  suppose  p  is  till'  poly  nomial  of  degree  d  such 
that  (1.2)  hohls,  liecanse  1  -  (  I  ^  ^  1’,!^'  as  an  oper- 

•  iloi  on  I, .,'(1.1]  alinihihiles  polynomials  of  degree  d, 

/III,'  -  l'.(/'\(ill  -  Pdi  1  1,1^  ‘  '’n\"'‘" 

„„(i)  i  11(11  (1,,^  I  ■  P"" 

V"  -  '  'n.,\  '  '’nf'V"' 

I ii»t ill”  liii-  1"!  Mf  i'i.i"  tii  liiiif' 


r  I 

‘  I  1-J/Mi  Ml 

di ' 


'/'yll  ilj 


/'ll 


III  n  \ 


whele  X,  {i  xp(2'ikl  l(.  ,  ,1  ,  I-  a  i(omple\ 

In  '  I  I  I  I '  I  ■  n;  A  k'  \  ' 

\ahieil,iii  -(2A  (  linialii:-:.  iX,  ileiiol  e-.  t  he  <  omolev 

I  n  ' 


I  hi'  lepie.-enl  al  ion  'how'  lii.il  'In'  beh.ivioi  of  the  bi.i' 
iloi'.-  not  depend  oii  the  pol\ iionii.il  pit  e  and  the  de\  n  < 
of  adihiig  polynomi.d  leiii;-  to  the  l  ni'onoiiiel  i  n  legte.. 

Ion  fiei’.  I  he  behauol  of  I  he  hi.i-  bom  peiiodii  bonn 
d.ii  \  I  oiidit  ioii' 


2.S.S 


Finally,  to  get  a  matrix  representation  for  = 

And 

"^•211  "  {Pj'nA(^-)h<r<ii;l<j<d- 
Equations  (2. 2)  and  (2.3)  tlu'ii  beconio 

=  "''''l„Xl,'>'  +  E>nlX,,>2„r'X2;,y 

(2..1) 

and 


It  ran  l)0  shown  (soe  Eubank  (1988))  that 

Oj 

^*kn  ~  ^*k+ns' 

S  —  — X) 

'I'litt  prtKtf  of  tlio  main  ibcorc'in  rests  on  the  following 
estimate  for  the  convergence  of  the  discrete  Eourier 

coefficient  'I'he  proofs  will  he  detailed  elsewhere. 
I.tnimn  I.  Under  the  assumptions  of  the  I'heorem, 

^‘kti  ~  ^'k  ^  I''!  -  "/-•  (■*•  * 

The  "big  oh"  tertii  holds  uniformly  in  n  over  the 
siM'cified  range  of  k. 


-ti“‘Y  X  \ 

M  -  /tp  II  ’'lii^lll/'O 

-Yu„{X,;X,„)-'x,>,,  (2.5, 

3.  ASYM[‘TO'riC  IlESUi;i'S 
The  proof  of  the  th'^rretn  depends  heavily  on  the 


I'ourier  series  representation  of  .N'otatioti  and  dt'veE 
opment  here  are  adapted  from  E.uhatik  (I'lSK).  l.et  Oj^ 
he  the  kth  I'’ourier  coefneient  of  //jj  denned  as 


,1 


Ok  =  exp(-2dkt)//(^(t)dt. 


I'hen  y/(j  has  the  series  represental itiii 

7, 

/t,j(t)=  2:  o^exp{2-iki ). 


Both  sides  of  this  expression  can  he  differetitialisl 
(foriiiiilly )  111  limes  to  obtain  l he  series  I'xpiinsion 


pl"'^(t)  -  5)  (2-ik)"'o,  exp(2-ikl  ).  (3.1) 


Ibis  shows  that  the  kth  fourier  coi'ffii  ieni  of  i.-' 


2,Tik)'"o|^.  for  I)  <  m  <  d.  the  ,i.'.'Uiiii)l loti  t  1,, 


implie.,  |)oinlwise  convergence  in  (3.1,  lexicpt  at  II  and 
I  ,.  I’a  r.^eval'.^  eqiialil  \'  gives 


f  /'|j"'*U)~dl  -  ')  (2-k,‘'"'|o|^r- 9  '  in  '  il- 


(3.2  I 


I  he  kill  l  oili  ier  <  oi'ffii  i.'iil  loi  //  in  K'  i,-  delined  lo 


n  /(  (i/n|expj  oriki/n).  l-i.-i.i 


l.iinnuiJ.  IfA'/n-'O. 


-1  t 

n  P  \P  1 
'  iiiiA'  vnA 


•  w  I  iw  ,,(Ue:ivl/2 
2(  U4  V-1  ,(  -II  '  .  II  *  \  even. 


u  •  \  Olid  . 


and 


-1  ( III  I  .2  ,,  ,  2iii , 

n  iqe„  ^:|  e.  (),A  | 


.(  -.1 


1  o  |>roceed  lo  I  he  proof  of  l  he  main  l  hiHUein.  w  i  iie 


.WISEi 


■  .  ,  - 1  v  1  ( m  I  .  I  ■  1  III  I, .  -2 

,1  A  1  =  II  \  I//  II;!-  f  ,/  1  !  I  ;  ! 


i-  I 


'  I ' 


l'\ 


,11 

-I  V  \  •  ,  I  111  I  , 

+  n  h  \  ar(//,  it.ii. 

i .  ,  '  ' 


Erom  (2.'i),  we  can  decompose  ihe  hi.i.-  iiilo  l  wo  com  - 

-  „'"U 


poiients.  h|  ^  - h  Y||^X||^/t|l  and  h._,^ 


--  Y.,  (X.,  X.,  )  X.,  U...  file  Slllllllied  squared  bias  ill 

2n  2n  2n  2nM,  ' 


.ANISE,  is  ihen  ilh|  y!|“  t  lih.,yi|"  +  2h|^h.,y  lo 

ohiain  the  de.-'ired  rale  of  convergeni  e  for  ihe  bias,  il 

•)  •) 

>iiffi(es  lo  show  that  l|h|  ^11“  and  ||h.,^l"  are  hoi  h 
OiiiA 

We  begin  with  h 


'I  A 


(h,  ^^(1 hi  y  I II 1 1  .  Hv 


(3.3,  and  the  oi  t  hogonalil  y  of  the  j  exjij  2  rrikr/ii } ) , 


'nV'tl'"  """  ''>^11 


n  ('|,i||"''(r/ii)exp{-2:nkr/ii).  we  obtain  for  l  ( 


(i  ■  -  .1  I 

*  1  tt  ' 


'-It"’  '-ikisdi  l,/2'‘kn"''^l'l-'^'l 


.,.:.(2-iki"'o,  exp|2-ikil 


and 


I,.  ,2 


(  III  I 


The  notation  S  in  the  last  line  above  denotes  sum- 
* 

mation  over  the  range  A  <  jk]  <  (n  -  l)/'2.  From  (A.  l) 
applied  to  and  the  fact  that  has  kth  Fourier 

coefficient  (27rik)”*«l^,  wo  have  =  {27rik)”’a|,  H- 

~  Hence  using  (3.4)  in  (3.(i),  we  obtain 

the  bounds 

,r‘||b.J|-=  ^  |(2;rik)"'o^  +  0(ir('‘“"'b 

|k|<A 

-(27rik)"’(o^  0(n~'‘))|- 

-h 'J|(27rik)"V.,.  -t  0(ir*'*~’'’hl'- 


1  he  first  sum  on  the  right  is  botttidi'd  bv 

()(An---('‘  -  +  0(A“"'  -  =  o(A--<''  -  ■">) 

a.''  A“/n  -  0.  The  setotid  siitn  is  ixninded  by 

2mi .  i2  ,  ,^,.,-2(d  -  III)  -t-  I  , 


2V^.,j(2.k)-"'io,,r^  0(1. 

<  2(2rrA)”-‘'‘  ^ 


1  i2di  |2 

A<|k|(-''><l 
-2(d  -  III)  -t  I  , 


-‘r  ()(  II  y 

■> 

liy  (3.2).  r.sing  A“/n  -  0  again  shows  thai 
■  ..  .-2(d-m) 


if'llb 


I  A'' 


()(A 


(3.7) 


.\e\t,  e(|iialioii  (3. .7a)  iinphes  that 


.r’x,X 


(3.S) 


where  Cl  is  a  d  «  d  posiii\e  deriiiiie  iiiatri.x.  Csiiig  the 
I..,  matrix  norm  1|A||“  =  ■'“l>||x|l >(|  ll^’'||/i!xl|  and  the 

fact  that  l]A|i"  <  tr  A*A.  it  can  lie  shown  that 
II  '(|b.,  is  asyiii[)totically  bounded  by 

Hut  11  '|[(l  -  Tj^  ^)/j|jl|”  -  0(A  ■■'')  by  13,7)  with  m  - 
I).  and  n'‘tr(Y.,\.,j^)  =  OIA'-’"')  by  (3..7I,),  hen.  e 
'"N.  l  ids  completes  t he  i>ri)i>f 

for  t he  bi.is  term. 

l  o  estimate  the  vaiiam  e.  re.  all  thal  and  X.j^^ 

ar.' ort hogoiial  aiul  t hat  X|^^X|^^  -  iil.  i  hen  from  (2.1 ) 
and  (3.X). 


tr  Var(/i*^"'’)  i-  (I'n  '  1 1 (  Y ^ Y j  I 

t  Y._,,  HiCi  'i. 


I  hi'  s.'.ond  li'im  on  the  right  is  again  Of  A"'").  F.ii  th.' 
first  term,  not.’  that  ii  bas  (u.\  )  element 


—1 

II  (— 2)riu)"'(27riv)”‘  ^  cxp{2ji(u  -  v)r/n} 
r=0 

0  11  jf  V 

(2fu)“'”  u  =  V. 

('on.sequoiitly.  tHYj^^Yj^^)  ^  2  (27rk)"'"  ~ 

2(2;r)“"'A“"'  '/(‘-bn  +  1).  This  complet.'s  the  proof 

of  the  theorem. 

4.  A  GROWTH  CURVK  APPLICATION 

One  application  of  nonparametric  derivative  esti¬ 
mation  has  Ix-en  to  the  study  of  growth  curves.  The 
derivative  of  the  growth  curve,  called  velocity,  is  of 
sp('cial  interest  in  analyzing  growth  spurts.  .'An  example 
using  growth  data  supplied  by  Dr.  h.  .Molinari  on  a  boy 
and  a  girl  is  reported  in  Kubank  (I'tStv.  pp.  1.76  ff,  p. 
1S6).  Figure  1  shows  plots  of  the  raw  data  and  the 
growth  curve  estimates  using  PTH.  In  this  example, 
the  parameters  d  and  A  were  both  chosen  using  Gener¬ 
alized  Gross  Validation  (GG\').  1  he  procedure,  dis- 

ciisse.l  in  Eubank  and  Speckman  (l!)iS8),  is  a  data-based 
estimate  of  the  parameters  d  and  A  which  would 
theoretically  minimize  .A.NLSE.  GCA'  selecte.i  d  =  3  and 
A  =  8  for  the  boy  and  d  =  3  and  A  =  1  for  the  girl.  The 
residuals  for  the.s'e  fit.s  ar.'  plotted  in  Figure  2. 

Th.'  PTH  derivative  estimates  are  plotted  in  Figure 
3.  Hecaiise  the  m.'thod  uses  a  projection,  there  may  be 
a  spurious  local  maximum  in  the  estimate  of  the  boy's 
v.'kK  iiy  curve  around  12  years.  However  growth  spurts 
r.nighly  at  ages  7  and  M  are  clearly  visible  and  appear 
to  be  "real".  This  analysis  agre.'s  with  ih.'  results  from 
kernel  smoothing  re|X)rted  in  Iviibaiik  (1(188).  Th.' 

v. 'locils  estimate  lor  tiie  girl  is  similai  wiili  two  appar- 
.'iit  gr.iwl  li  spurts. 

Tliis  analysis  demonstrates  the  simplicity  and  tis.'ful- 
n.'ss  .)f  PTH.  The  P  TH  models  can  be  fit  with  virtually 
any  regr('ssi()ii  package.  Thus  g.xtd  derivative  .'siimaies 
as  in  Figure  2  can  be  obtained  with  m;  specialized 
soft  War.'.  Becaiis.'  then'  are  inissing  obs.'rvations  in 
b.)lh  .lata  s.’ts.  th.'  assumption  .if  e.|ually  spaced  jxiints 
d.x's  11. It  hold,  and  the  results  of  the  th.'.irem  do  not 
dir.'clly  appiv.  However,  w.'  Ix'lieve  llial  the  estimates 
obtaine.l  in  tli.'s.' examples  d.'moiisi rale  iliai  P  1  H  (an 

w. uk  in  |)racti(.' ('veil  for  uii('(|ually  spaced  data. 
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EFFICIENT  ALGORITHMS  FOR  SMOOTHING  SPLINE  ESTIMATION  OF 
FUNCTIONS  WITH  OR  WITHOUT  DISCONTINUITIES 

Jyh— Jen  Horng  Shiau,  University  of  Missouri-Coluin.iia 


ABSTRACT 

Efficient  algorithms  are  developed  for  GCV  smooth¬ 
ing  spline  estimation  of  a  function  which  is  smooth 
except  for  some  "break  points"  where  discontinuities 
occur  either  in  the  function  itself  or  its  lower  order 
derivatives.  For  a  problem  with  n  observations,  these 
algorithms  require  0(n)  operations  for  the  equally 

spaced  knots  case  and  0(n“)  operations  for  the  un¬ 
equally  spaced  knots  case.  Similar  algorithms  are  also 
derived  for  ordinary  smoothing  splines,  that  is,  without 
discontinuities. 

KEY  WORDS;  Smoothing  splines,  partial  splines, 
discontinuities,  efficient  algorithms,  generalized  cross 
validation. 


1.  INTRODUCTION 

To  solve  the  problem  of  estimating  an  unknown 
function  which  is  smooth  except  for  bn-ak  points  (or 
turves  or  surfaces)  where  discontinuities  occur  cither  in 
the  function  itself  or  its  lower  order  derivatives,  Shiau 
(lOS.'i)  proix)sed  a  pauial  spline  ap[)roach  to  extend 
smotrthing  spline  estimation  method  by  augmenting  it 
with  jump  functions  to  reflect  the  discontinuities. 
Shiau  (1987)  proposed  some  methods  for  inference  on 
the  magnitude  of  the  jumps  ba.sed  oti  the  mean  square 
error  of  the  estimate.  Shiau.  Wahba  and  Johnson 
(1986)  employt'd  the  tnethod  to  provide  models  to 
include  specif^ied  discontinuities  iti  otherwise  smooth 
two  or  three  dimensional  objective  analyses  and 
demonstrated  that  the  model  is  appro()riate  for 
including  tropopause  height  itiformation  in  temirerature 
analysis.  In  this  paper,  wo  present  some  efficient 
algorithms  for  this  particular  prol)lom  in  the  univariate 
case  which  utilize  a  special  structure  of  the  covariance 
matrix  of  smcxfllung  splines,  and  we  show  that  the 
Generalized  Cross  Validation  (GCV)  method  of 
choosing  the  smoothing  parameter  as  well  as  the 
smoothing  spline  estimate  can  te  achieved  in  0(n) 

o[X‘rations  for  equally  S|)aced  data  and  0(ir)  operations 
for  unequally  spaced  data,  where  n  is  the  number  of 
observations  in  the  data. 

Given  noi.sy  data  {y|.  i=l.2 . n}  ob.served  from  an 

unknown  function  g  at  (t..  i  =  l._’ n)  in  the  interval 

(0.  1].  we  consider  the  following  noni)arametric 

regression  model: 

.Vj  =  R{tj)  +  (j.  i  =  l.J . n.  (1.1) 

where  r.'s  are  inicorrelated  with  mean  zero  and  common 

I 

variance  rz"  For  ordinary  spline  smixUbing.  the 
estiniale  gj^  is  the  minirnizer  of  the  following  variational 

problem: 

min  i  1'  {ypf(t.))-  +  jf  (f* '">(1  ))“dt  (1.2) 

fcll"i  =  l  '  '  -Iq 

where  II  is  the  Sobolev  spac  i'  W',^'- (f  |  f^^^^are  absolutc'ly 

continuous  for  e  =  l.2....n]-l  and  i  6  l,.,|0.1)(.  1  he 

sm(K)ihing  parameter  A  conirol.s  the  tradeoff  between 


"do.seness"  to  the  data  as  measured  by  the  first  term 
and  "roughness"  of  the  solution  as  measured  by  the 
second  term.  As  is  well  known,  the  choice  of  the 
smoothing  parameter  is  crucial  to  spline  estimates.  We 
will  adopt  the  Generalized  Cross  Validation  method 
which  was  introduced  by  Craven  and  Wahba  (1979)  to 
estimate  A  since  GCV  method  has  been  proven  to 
provide  a  nice  estimate  of  A  theoretically  and 
numerically.  For  theoretical  results  on  the  efficiency  of 
GCV'  estimate  of  A,  .see  Craven  and  Wahba  (19'79), 
Speckman  (1982)  and  Li  (198.5,1986). 

For  problems  with  discontinuities,  by  choosing  H  to 

Ix"  the  .Sobolev  space  w',^'  augmented  by  a  jump  spare 

consisting  of  some  appropriate  truncated  polynomials 
with  derivatives  defined  almost  everywhere,  the 
estimate  gj^  of  g  ran  Ix' sh'  lobe 


SA-A/i'‘i  +  .FVi\-,  "tV  I'-:!) 

1=1  j=l  J  J  k=l 

whore  the  c's.  d's  and  O's  are  real  nunibors.  and 


W4  \ni— 1/.  xin- 

l(l-u)_^  (ij-lOj. 


■du.i  =  l,...n. 


°  ((m-l)!)“ 

tj~’ 

Odt)  - .j=1.2., 

J  (H)! 

(t-ajfk 

7c(t)  = - .=  1.2,, 


.m. 


[l.-t) 


(1.5) 


(1.6) 


with  (x)_|_  =  max  (x,  0).  Note  that  the  jump  function 

7^.‘^kV)  discontinuous  at  Oj^.  Furthermore,  the 
break  points  0|^'s  need  not  be  distinct.  See  Shiau  (1985) 
or  Shiau  (1987)  for  details. 

Let  ^  bo  the  n  by  n  matrix  with  (i.j)-th  entry  ^j(tj)- 
T^^  bo  the  n  by  m  matrix  with  (i.j)-th  entry  Oj(tj)  and 
'I’^j  bo  the  n  by  q  matrix  with  (i.k)-th  entry  7|^(tj). 
■Note  that  the  polynomial  basis  {Cij}  and  the  truncated 
|X)lynomial  basis  {7|^}  are  all  in  llie  null  spare  of  the 


siiKxrthing  functional  .1(f)  =  [  (f^^’^ftlj-dt.  betting 


T  =  I  T  T.  1  and  .1  =  (  d, . d  .  0. .  0  it  can 

>  |)  d  '  'I  ml  q 

be  shown  that  (1.2)  is  etiuivalent  to 

min  i  ||y  -  (^  c -t-  r  ^)|1" -I-  A  c'lii  c.  (1.7) 

c,P  " 

which  is  i'(iuivah'nt  to  solving  the  following  linear 
.system  of  equations: 

(  (  ^  +  uA  1  )  c.  =  y  -  r  P 

(1.8) 

I  I'c,  =  0. 

This  is  the  same  form  as  fix  ordinary  snuxilhing  splines. 
It  is  well  known  that  tlu'  solution  is  unique  provided 
that  T  is  of  full  rank.  .Not('  that  g^  is  a  linear  estimate 

since  it  can  Ix'  expressed  by  g^  =  ..\(A)y.  wlnxe 
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g;y(f2) . 

A(A)  =  L'\r^(i-T(T’'NrW'rh'lNr') 

+  T(T‘Ar'Tr’T^\r' 
with  M  =  S  +  nA  I. 

Numerous  algorithms  are  available  for  ordinary 
smoothing  splines  and  partial  splines.  Reinsch's 
algorithm  (Rcinsch,  1967)  is  an  0(n)  algorithm  for 
computing  the  spline  estimate  g^  if  A  is  fixed.  The 


difficulty  of  computing  the  GCV  function 

H  II  (l-A(A))y  ||2 
V(A)  =  2 - 


(n'r(I-A(A)))“ 


(19) 


lies  in  the  computation  of  the  trace  in  the  denominator. 


VVendelberger  (1981)  developed  a  numerical 
algorithm  for  obtaining  the  GCV  estimate  A  and  spline 
estimates  of  functions  of  several  variables.  This 
algorithm  is  practical  for  moderate  data  set  problems, 
but  is  not  practical  for  large  data  set  problems.  The 
reason  is  that  it  involves  an  eigenvalue-eigenvector 
decomposition  of  an  n— d  by  n— d  matrix,  where  d  is  the 
fixed  dimension  of  the  null  space  of  the  smoothing 

functional.  The  complexity  of  the  algorithm  is  0(n*) 
due  to  that  costly  decomposition. 

Bates  and  Wahba  (1983)  suggested  some  methods  to 
reduce  the  computing  burden,  including  using  basis 
functions  (e.g.  B— splines)  of  a  subspace  of  smaller 
dimension  or  a  truncated  singular  value  decomposition 
to  handle  large  data  set  itrobiems.  Recently,  Bates  et 
al.  (1986)  have  developed  a  public  domain  software 
package  called  GCVP.ACK  for  computing  smoothing 
splines  and  partial  splines  for  the  multivariate  case. 

3 

The  procedures  require  0(n'  )  operations. 

Utreras  (1980)  proftosed  an  approximation  to  the 
trace  of  A(A)  in  the  case  of  equally  spaced  data.  This 
approximation  requires  0(n)  o[)erations  for  its 
calculation,  so  the  GCV  can  bo  obtained  cheaply. 
Utreras  (1981)  considered  the  ca.se  of  not  necessariiy 
equally  spaced  data  and  obtained  an  approximation 
that  lias  an  initial  overhead  of  finding  the  lowest  n 
eigenvalues  of  a  2n  by  2n  band  matrix  of  bandwidth  5 

9 

(for  m  =  2)  which  requires  0(ir)  operations. 

Based  on  the  special  structure  of  cubic  splines. 
Silverman  (1981)  modified  Utreras'  approximation  and 
developed  a  linear  time  procedure  called  "A.symplotic 
Generalized  Cross-Validation"  to  obtain  the  smoothing 
parameter. 

Elden  (1984)  modified  the  method  of  computing 
GCV  function.  Instead  of  computing  the  singular  value 
decomposition  (SVD)  of  an  n  by  p  matrix,  ho  used  a 
bidiagonalization  which  in  fact  is  the  first  part  of  a 
singular  value  decomposition.  He  then  showed  that 
starting  out  from  the  bidiagonal  decomposition,  the 
GCV  function  can  be  computed  in  0(n)  operations.  He 
claimed  that  if  n  and  p  are  close,  then  the  computation 
of  this  algorithm  usually  requires  less  than  one  third  of 
the  work  for  the  full  SVD.  However,  the 
bidiagonalization  for  an  n  by  p  matrix  still  needs 
2 

0(np  )  operations. 

Recently,  for  computing  the  GCV  function  for  the 
general  regularization/smoolhing  problem,  Gti  et  al. 
(1988)  developed  an  algorithm  which  is  based  on  the 
Hou.seholder  tridiagonalization  similar  to  Elden's 
(1984)  bidiagonalization  This  speeds  up  the  algoriinm 
used  in  GCVPACK  by  a  factor  of  6  for  n  large  (>  .^lOO). 


The  source  code  of  the  software  package  implementing 
this  algorithm,  called  RKPACK,  and  the  report  on  its 
performance  can  be  fotmd  in  Gu  (1988). 

Note  that  these  fast  0(n)  algorithms  involve  some 
kind  of  approximation.  They  either  approximate  the 
solutions  from  a  subspace  of  lower  dimension,  truncate 
some  smaller  eigenvalues,  or  approximate  the  trace  of 
A(A).  In  the  following  sections,  based  on  the  special 
structure  of  the  covariance  matrix  S  ,  we  propose 
efficient  algorithms  for  the  one  dimensional  case  to 
compute  spline  estimates  and  V(A)  for  the  functions 
with  or  without  discontinuities. 

Recently,  Hutchinson  and  de  Hoog  (198.'jj  developed 
a  linear  time  procedure  to  compute  the  ordinary  GCV 
smoothing  spline  based  on  Reinsch's  algorithm  for 
computing  the  trace  of  A(A)  in  the  general,  not 
necessarily  equally  spaced  or  uniformly  weighted  case. 
We  note  that  their  approach,  although  quit''  different 
from  ours,  is  actually  based  on  a  very  similar  structure. 

In  Section  2,  we  describe  the  special  structure  of  the 
matrix  ^  which  inspired  the  construction  of  algorithms. 
In  Section  3,  a  linear  time  algorithm  for  smoothing 
splines  with  jumps  is  derived  for  the  equally  spaced 
knots  case;  also  a  quadratic  time  algorithm  is 
mentioned  for  the  unequally  spaced  knots  case.  A 
simpler  algorithm  for  ordinary  smoothing  splines,  i.e., 
without  jumps,  is  given  in  Section  4. 


2.  SPECIAL  STRUCTURE  OF 

m 


Inspired  by  a  manuscript  of  Wdhba  (1969)  where  a 
.special  structure  of  matrices  to  be  inverted  for 
Tchebychev  splines  in  their  most  general  form  is 
exhibited,  we  observe  that  the  covariance  matrix  E 

(defined  in  Section  1)  corresixniding  to  m,  can  be 
transformed  to  a  symmetric  (2m-l)-band  matrix.  This 
special  structure  of  will  be  used  to  develop  efficient 

algorithms  in  Section  3  as  well  as  in  Section  4.  To 
describe  the  transformation,  we  first  define  an  n-m  by  n 


matrix  which  transforms  g=(g(tj),  g(t.2)....,  g(tjj))*' 

to  an  (n— m)-vector  corresponding  to  the  second  divided 
difference  of  g.  Here  we  adopt  the  definition  and 
notation  in  deBoor  (1978).  Denote  the  m-th  divided 
difference  of  a  function  g  at  points  tj,tj^j,...,tj_|^jj.^ 

by  (tj,...,tj^i^.^]g.  Assume  that  tj's  are  all  distinct  (the 

problem  of  repeated  observations  can  bo  resolved  by 
averaging  repeated  observations  and  assigning 
appropriate  weights  to  data  points),  and  let  A^^^  be  the 

(m+l)-band  (n-m)  by  n  matrix  with  (i,j)-th  entry 
i+m 

H  (tj  -  t.  )  for  i  <  j  <  i+m, 

I  mI 


0 


Then 


otherwise. 


(2.1) 


1 

'8(h)' 

■It,... 

-h+mle‘ 

8(19) 

= 

[t.,.  .  , 

■■4+mls 

8(tJ 

^Si-m' 

. '  n'*' 

For  example,  letting  tn-2,  n=5  and  tj=  i/5  ,  we  have 
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1-2  1  0  0 
0  1  -2  I  0 

0  0  1-2  1 


\Vc  remark  hero  that  Reinsrli(l!)()T.l!)71)  and  Shiller 

(1981)  also  used  this  matrix  in  llie  siiliiie  context.  'I'lien 

the  covariance  matrix  ^  can  he  transformed  into  a 
m 

(2m-l)-band  matrix  as  statt'd  in  the  followins 
proixwition. 


Proposition  2.1.  is  a  symmetric 

(2m-l)-band  matrix. 

.‘\  proof  of  Proposition  2.1  is  in  the  .Appendix. 

We  can  expand  A  to  an  c.  by  n  invertible  matrix 


A  bv  addiiic  m  rows  on  the  top  of  A,  .  We  cluxtse  the 

following  setup  to  make  new  rows  consistent  with  the 
others  in  A„^.  Let  t,,  =  0,  t_,  =  t, . 

and  define 


()i  h(‘r\v  i  so. 


(2.2) 


Then  A  is  a  1  ,wei  tiiangnlar  matrix  with  nonzero 
diagonals,  hence  invertible. 

Proposition  2.2.  A^ij  A^’i  is  a  symmetric 
(2III-1  )-hnii(l  matrix. 

A  prtKtf  of  Proposition  2.2  is  in  the  .Appendix. 

3.  NUMBRICAL  AI^GORri'HMS  FOR  SMOOI'UING 
SPUNKS  WITH  JUMPS 


For  simplicity,  the  subscript  of  A^^i  will  b<' 
siippre.s.sed  if  no  confusion  can  occur.  Denote 

c  =  (A^)“'c,  T  =  AT,  y  =  Ay,  and  =  A(X+nAI)A' 


=  A  M  a'. 

etpiivalent  to 


Then  the  system  of  etjualions  (1,.',)  is 


M  c  +  T  /J  -y 

r^c=  0. 


(3.1) 


and  the  solution  can  be  expressed  ex'itlicitly  as 

(/?=  (  t '  T)^‘T’.^  ''y 

{  (3.2) 

1  2  =  ,’^~'  (  I  -  T(T'ftl-'T)~'T'!C1"‘)y. 


.N'ote  tliat  by  tlie  construction  of  tlie  matrix  A  and 
Proposition  2.2,  iCl  is  a  symmetric  (2m+l)-l)and 
matrix,  I'herefore  tlie  inverse  of  !?)  can  Ix’  computed  in 
0(n'^)  operations  (e.g.Donearra  et  ab,  1979).  Moreover, 


note  also  that  T  =  AT  is  a  very  sjiarse  matrix  due  to 
I.etnrna  A.l.  'I'lierefore,  /)  can  be  comjtuted  witlioiil 

involving  many  entries  of  .'Cl  '  for  the  equally  spaced 
knots  ca.se. 

We  only  consider  the  case  of  m=2.  In  the  following, 
we  develop  a  procedure  which  computes  each  entry  of 


.Cl  in  constant  time  afler  a  linear  lime  overhead. 
I'his  leads  to  the  linear  time  algorithm  for  the  ecpially 
spaced  knots  case. 

For  t.  =  i/n,  i  =  l,2 . n,  and  t_,  =  -  t 


Iq  =  0,  by 


(2.3),  A.,  can  be  expri'ssi'd  explicitly  as 

0  ■ 

I  0 
1 


1  2 
2" 


and 


AA'  = 


_.2 

1 

0 


0 


-•2 
1  -2 


tJ 


•  0  1  -2 

1  0  ■ 

-1  1  0 

G  -1  1  ( 

-1  G  -1 


(3.3) 


0 


1  -1 

0  1 

■  0 


(i  -1 
-1  ti 
1  -1 


CU] 

It  can  Ix'  shown  that  Cl  has  the  same  .7-band  structure 
as  in  (.3..1)  with  ,Clj|  =  n(2+L)/2.f,  .Cl^.,  =  n(l-2L)/2.1, 

.Cl.,.,  =  n(.|+.5L)/2l.  .Cl.j  =  n(.l+(iL)/2.i,  for  i  >  3. 

.Clj  =  n(l-li,)/2I,  for  i  >2  and  .Clj  =  nL/2‘1.  for 

all  i  >  1.  where  L  =  (in'^A.  Recall  that  the  (j,i)-th  entry 
of  Cl“‘  can  be  computed  as  the  ratio  of  the  cofactor  of 
the  element  .Cljj  and  the  determinant  of  Cl.  IRilizing  the 

band  pattern  of  the  matrix  Cl,  we  are  able  to  compute 

each  entry  of  Cl  *  efficiently  by  the  following  procedure 
which  is  descrilx'd  in  a  more  general  form. 

Let  .A*'  be  a  k  by  k  symmetric  .7-band  matrix  (of  the 
same  form  as  Cl)  with  Ajj  =  a,  .Aj.,  =  b.  Aj  =  c. 
for  1  <  i  <  k  -  2.  .A.,.,  =  d.  A|  |^l  =  e,  for  2  <  i  <  k-1 

and  A|j  =  f.  for  3  <  i  <  k.  l,et  h''  lie  tlie  k  by  k  matrix 

of  tlie  lower  block  of  that  is,  the  7-band 

symmetric  matrix  with  Hjj  =  f,  for  1  <  i  <  k,  IT  j^[=  O- 

for  1  <  i  <  k  —  1  and  15|  =  c,  for  1  <  i  <  k  -  2.  .Also 

define  and  R*'  to  Ix'  the  k  by  k  matrices  obtaiiuxf  by 

removing  the  last  row  and  tin'  j-ih  column  of  .A*'"*^' 

and  respectively.  Let  B''(j)  be  the  k-1  by  k-1 

matrix  obtained  by  removing  the  first  row  and  the  j-th 

column  of  H*'.  Denote  by  Oj^,  0|^,  .t^..  and  the 

determinants  of  matrici's  .A*',  .A*'.  11*'',  fl*'  and  Il*'(j) 
res|X'ctively.  We  also  lus'd  to  comput(>  ■)^.  the 

deti'rmiiianl  of  the  matrix  F*'  (-  H*'^*(k+1))  which  is 
the  k  by  k  iiialii.x  obtained  by  deleting  the  first  row  and 

the  last  column  of  the  matrix  11*'^'. 

Prondiirr  I.  (Computing  any  entry  of  (.A'')  ') 

Step  1  Recursivi'ly  compute  {o|^.  k=l,2 . n)  and 

(0|^,  k=l,2 . Il  l)  as  follows; 
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‘'k  =  f“k-l  +  «’"k--2  -  ^"■"k-3  + 

\  =  «l,_2  +  c"\_2 

with  initial  conditions 


1,  Oj  =  a.  a.j  =  ad  —  h",  a.^  =  dct 


a  b  c 
b  d  c 
c  e  f 


Qq  =  0,  Oj  =  b,  a.,  =  ac  -  l)c. 

Step  2.  Recursively  compute  {Jj^,  k=1.2 . n}  and 


{3j^,  k=l,2,....n— I )  as  follows: 


%  =  f4_i  -  +  ce3^_,  -  fc“  +  c^;i^_, 

•3  •) 

'4  ~  ^  '^k-i  “  '4-'?  '4-'> 

with  initial  conditions 

^—3  ~  ~  '4  ~  * ' 

75_,=^o  =  o. 


Stop  3.  Recursively  compute  {7|^,  k=1.2 . n-1 }  by 

\  =  "  \-l  -  \-2  +  \-3  - 

with  initial  conditions 

7_3  =  T_2  =  7_j  =0,  7g  =  1 . 

Stop  4.  To  compute  for  j  >  i,  we  first  compute 

the  cofactor  of  the  element  a'|'j.  ('of  A'j‘j.  as 

+  (f-d)c;4_2j_2}-for,ir2. 

Cof  A^j  =  -  c-aj_,  for  2  <  i  <  n, 

ro(A!'j.(-i)i+)(  V, 

-f  Vi/4-i.H  +  '''’^V2^4-i-i,j-i-i  f- 

for  2  <  i  <  j  <  n, 

where  =  7j_p4_j  -  <Aj_2\-j  +  ‘■'4-T4-j-l' 

for  1  <  j  <  k  <  ti. 

Then  (A'')jj  =  (Cof  A'^.j/a,,. 

The  complexity  of  this  procedure  is  easily  seen  to  be 
0(n)  in  step  1  through  step  3.  and  rom|)uting  one  entry 
in  step  4  is  0(1).  Therefore,  if  we  watit  to  get  the 

inverse  of  the  i)est  we  can  do  frotn  this  i)rocedure  is 

0(11“^)  since  1^1 is  in  general  a  full  matrix  even  if  is 

banded.  Hut  T  is  a  very  sparse  matrix  in  this 
aitplication.  If  m  =  2,  there  are  only  3  +  2(i  nonzero 
entries  where  q  is  the  number  of  break  poiut.s.  Hased  on 
that,  we  shall  show  that  V(A).  c.  and  /?  can  be 
computed  in  0(n)  operations. 

(1)  The  solution  0  =  (T*^  *T)  *  T*Al~'*y  takes  at 
most  (3+2q)n  entries  of  ^  *. 

(2)  Since  XI  is  a  band  matrix,  Xl  c  =  y  -  T  0  can  1k' 
solved  in  linear  time,  e.g..  stx'  Dougarra  et  al.  (I!)7!)). 

Then  c  can  be  fransforeted  back  to  c  ill  linear  lime  by 

c  =  ,A  c  . 


(3)  To  show  that  the  GCV  function  V'(A)  can  tx' 
computed  in  linear  time,  we  first  note  that 

l-A(A)  =  nA  \r4l  -  T(T^r4’)“^T^Nr') 

=  nA  -  T(T4X1~^T)“'T^X1“45,  (3.5) 

and  that  the  numerator  and  the  denominator  of  V(A) 
are 

||(I-A(A)yf  =  II  nA  5^2  1|“  =  1|  iiAc  |1“,  (3.6) 

tr(I-A(A))  =  iiA  [  tr{A'X1~'A) 

-tr(  a^?:i''t(T^xi“’T)~‘t‘xi~’a)].  (3.7) 

To  compute  the  trace  of  (  =  tr  {Xl~*  AA^)  ), 

we  attuailv  only  need  the  eentiai  2iii-r  i  balluD  uf 
since  only  the  central  2m+l  bands  of  AA^  are  nonzero. 
•Also  the  (m+1)  by  n  matrix  T*X1~*A  can  be  computed 

in  0(n)  operations  again  by  the  sparsity  of  T  and  A, 
which  shows  that  the  second  trace  term  can  also  bo 
obtained  in  linear  time.  Thus  V(A)  tan  be  computed  in 
0(n)  operations. 

Remark.  For  the  rase  of  the  unequally  spaced  data 

points  problem,  .X!  does  not  have  the  regular  form  of 

(3.4).  Although  we  still  do  not  need  the  whole 
matrix,  we  do  not  have  a  linear  time  algorithm  to 

compute  the  required  entries.  However  Xl~'  can  be 

computed  in  0(n")  operations  by  the  band  structure  of 

Xl.  .Also,  T  =  St  is  still  sparse.  Therefore,  for 
unequally  spaced  data  ix)ints  problems,  we  liave  a 
quadratic  algorithm  wliich  is  still  more  efficient  than 
the  existing  cubic  time  algon'lim  in  the  partial  spline 
setup. 


4.  EFFICIENT  ALGORITHMS  FOR  OIU)INARY 
SMOOTHING  SPLINES 


As  a  byproduct  of  developing  the  linear  time 
algorithm  de.scribed  in  Section  3,  a  linear  time 
algorithm  for  ordinary  .smoothing  splines  for  ecinally 
spaced  data  is  available.  To  describe  the  aleorithm,  w'e 

first  note  that  AT  =  0.  Since  3’*c  =  0,  we  can  express  c 

as  A^7.  for  some  (n-m)-vector  7.  Then  the  system  of 
equations  (1.8)  can  be  rewritten  a,s  \V7  =  Ay  with 

\V  =  ADA*'  +  nA  AA*.  which  again  is  a  symmetric 
(2m+l)— band  matrix.  Thus  7  can  be  solved  in  linear 
time  and  then  the  solution  of  (1.8)  is 


I  c  =  a'  \V’  a  y  =  aS 
j  0=  (T'  T)“'T'{y-\V  c). 

.Also  we  have 

1- .A(A)  =  iiA  a’  \V"‘  a 

and 


V(A)  = 


(lr(A’\V'''A))“ 


The  pattern  of  \V  is  even  more  regular  than  the  .Al 
matrix  described  in  section  3.  In  tact,  \V  has  the  .^ame 

pattern  a.s  the  h''  matrix  described  in  the  previous 

.srx'iion.  Hy  replai  ing  by  and  by 
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in  the  Procedure  1,  we  can  obtain  a  procedure  for 
computing  any  entry  of  the  inverse  of  W  in  constant 
time  once  a  linear  time  overhead  for  recursively 
computing  three  sequences  of  determinants  is  done. 

Procedure  2.  (Computing  any  entry  of  (B^') 

Step  1.  Recursively  compute  k=l,2 . n}. 


{3j^,  k=  1,2, ...,0-1}  and  {7j.,  k=l,2 . n-1}  as  in 

Procedure  1. 

Step  2.  To  compute  (j,i)-th  entry  of  (B*^)  '  for  j  >  i,  we 


first  compute  the  cofactor  of  the  element  B”j,  Cof  b”j, 
as 


cofB",  =  A  ,  a 


i-l  ^n-i 


c^V2  ^n-i-1  ■  for  1  ^  ^  "■ 


for  i  <  i  <  j  <  n, 

where  4  •  =  +  oS-34h-I- 


for  1  <  j  <  k  <  n. 


Then  (B"):;  =  (Cof  b;'.)/^. 


Remark.  In  the  unequally  spaced  knots  ca.se,  W  is 

banded  but  is  not  con?tant  on  the  diagonal  like  b". 
However,  we  only  need  the  central  diagonal  (2m+l) 


bands  of  \V  to  compute  the  trace  of  1— A(A). 
Hutchinson  and  de  Hoog  (19So)  have  a  procedure  to 
calculate  these  bands  in  6(11)  operations.  Thus  a  linear 
time  algorithm  is  also  available. 


APPENDIX 


In  this  appendix  we  give  proofs  of  [Propositions  2.1 
and  2.2. 

Denote  v  =  (Vj,  V2,...,  v^)^  the  i— th  row  of  A  by 

Lemma  A.  I.  A^'^v  =  0,  provided  that  Vj.  Vj^|,..., 
can  be  interpolated  by  a  polynomial  of  degree  less 
than  or  equal  to  m-1. 

Proof  :  Let  Vj,  Vj^j,...,  be  interpolated  by  a 

polynomial  p  of  degree  less  than  or  equal  to  m-1.  Thus 
we  have  p(tj)  =  Vj,  for  j  =  i .  i  +  m.  It  is  well  known 

that  A^‘V  =  0,  where  p  =  (p(t|),  p(t,2) .  P(<„))'- 

Since  A. .  =  0  for  j  <  i  or  j  >  i  +  m.  we  have  A^'^v  =  0. 

Proof  of  Proposition  2.1.  Since  the  symmetry  is 
obvious,  it  suffices  to  show  the  (i,j)-th  entry  of 

's  0  for  all  j  >  i+m.  .Note  that 


(AJIaS--  =  ^  5:  A.,  'J,  A.  = 

'J  k=ls=l 


({>n-l)!)“ 
in-l 


rl  i+rn  ,  j-fin 

1,",  II  Ns<v“'+ 


(In. 


JQ  k=i 

If  j  >  i  +  m,  then  since  the  t^'s  in  the  .second  sum  are  no 
smaller  than  the  tj^'s  in  the  first  term,  the  integral  can 
Ix'  rewritten  as 


A 

I 

Jn 


1  i+m  ,  j+ni  , 

v  A  (4-u)';'->ll  V  A  (t^-ur-f]du. 

0  k=i  +  s=j  J® 

Then  by  Lemma  ,4.1.  the  second  sum  in  the  integrand  is 
0. 

Proof  of  Proposition  2.2.  Let  U  be  the  m  by  n  matrix 


formed  by  the  top  m  rows  of  A^.  .Note  that  Ibj  =  0  for 
i  <  j  <  n,  i  =  1,  2,...,m.  Then 

SSAf  =  [ 


..t,  .t 


Since  AEA^  is  a  symmetric  2m-l  band  matrix,  it 
suffices  to  show  that  (USA^)jj  =  0,  for  i  <  j  <  n— m, 
i  =  l,2,...,m.  By  the  same  argument  as  in  the  prtxtf  of 
Proposition  2.1,  we  have  (U^A^)jj  equals  to 

!/((m-l)!)~  limes 
rl  i 

U:,.  ][  5:  A:„  (L-u)‘"~Mdu, 


0  k=l  +  g^j  js  s 


r  i 

1 

Jn  ’ 


again,  which  is  0  if  j  >  i. 
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ON  THE  CONSISTENCY  OF  A  REGRESSION  FUNCTION 
WITH  LOCAL  BANDWIDTH  SELECTION 

Ting  Yang.  University  of  Cincinnati 


This  paper  studies  the  kernel  estimators  of  an 
unknown  regression  function  with  data-based  local 
bandwidth  (LB)  selection.  Under  the  weak  conditions,  we 
discuss  the  uniformly  strong  convergence  and 
convergence  rate  of  kernel  estimators  with  a  local 
bandwidth  (or  an  automatic  local  bandwidth). 

1. Introduction 

Let  X1.X2 . Xnbe  identieally 

independently  distributed  random  variables 
with  unknown  density  funetion  JIx).  We 
consider  the  kernel  estimator  /^x)  of  the 
density  Jlx)  defined  by  the  following  form 
(Rosenblatt-Parzen  type) 


M  nix.  h  n  )=  ^  i  K  j^j  Y;  //nix.  h  „) .  (1.4) 

is  determined  by  a  sample  (Xj.  Yj) . (X^. 

V^)  of  independent  observations  from  the 
population,  by  a  kernel  function  X(x)  and  a 
bandwidth  hr,,  in  (1.4).  h„  is  a  sequence  of 
bandwidths  with  (in'!  0  and  n/ir,->  ~  as  n  -> 
~.  and  fnlx.  h  n)  is  an  estimate  of  the 
marginal  density  J[x)  of  X.  See  Watson 
(1964).  Nadaraya  (1964)  for  the  original 
definition,  and  Hardle  and  Marron  (1985)  for 
recent  developments. 

When  sampling  independently,  uniform 
eonsistency  results  such  as: 


/mU- 

'“‘n  i=i 


(1.1) 

sup|/nlxh  rj-/  (x)|->0a.  s.  or 

XE  JR 

\  hn  ] 

.SUpjiWnf-*^.h  n  )-'T7(x)j  -^0  as. 

XE  JR 

(1.5) 


where  K(x)  is  a  real-valued  Borel  measurable 
function  on  E  and  hn  l>  0)  is  the  bandwidth 
which  is  assumed  to  satisfy  h,,-*  0  and  nh,,-» 
“  as  n  ^  If  K(x)  is  chosen  to  be  a  density 
function,  i.e.. 

J  K(x)dx=l  and  K(x)  >  0.  (1.2) 


fnlx]  itself  will  be  a  probability  density 
function. 

Assume  that  (X.  Y)  is  a  pair  of  random 
variables.  If  E(IYI)  <  oo.  there  exists  a 
regression  function  given  by 


m(x)=E(YIX=x)= 


/  ij  glx.  tyjdy 

7w 


- 

~j  w 


(1.3) 


where  f[x]  is  the  marginal  density  of  X. 
r(x)=jiyg(x.  ij]dy,  and  g(x.  g)  is  the  Joint 
density  of  (X.  Y).  Let  (X,,  Yj).  (X2.  Y2).  .  .  . 
be  independent  random  observations  with  the 
same  distribution  as  (X,  Y).  The  kernel 
estimate  M„(x.  11,,)  of  m(x)  defined  by 


were  obtained  under  certain  restrictions 
imposed  on  K  and  /.  and  under  the 
restriction 

nhr, 

- - -*  00  as  n 

logo 

These  results  can  be  found  in  papers  by 
Dcheuvels  (1974).  Silverman  (1978).  Collomb 
(1979).  Devroye  (1981).  and  Hardle  and 
Marron  (1985). 

In  practice,  the  choice  of  the  bandwidth  h 
is  one  of  the  crucial  points  in  applying  Af,,  (x. 
Ii).  The  estimator  (1.4)  exhibits  a  large 
variance  if  h  is  chosen  small,  but  it  has  large 
bias  if  a  large  li  is  used.  For  this  situation 
methods  of  a  global  selection  of  h  was  studied 
by  Hardle  and  Kelly  (1987). 

However,  when  data  are  quite  nonlinear, 
heteroscedastic.  and  nonhomogeneous.  using  a 
global  bandwidth  h  may  not  be  efficient.  This 
situation  motivates  the  study  of  the  kernel 
estimators  with  data-based  locally  varying 
bandwidth.  The  corresponding  kernel 
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esnmaior  is 

mn(xhri  W)=rn(^'^nW)//nU'^n  W)  (1-6) 


9(X)= 


/ 


yg.ix. 


y)dy 


where 


r„{x.hn  W)= 


nh,iW  jTi 


x-X, 


hnM 


(1.7) 


Assume  Gj(x).  J=0. 1 . k:+ 1 .  exists  for  all  x. 

2 

and  Gfc(x)  is  integrable  and  bounded. 

(C.4)  y  is  a.  s.  bounded,  i.  e.,  there  is  a 
constant  C.  such  that 

I  yi  <  C  a.  s. 


fnlxMn  M)= 


x-X  I 

nbn(x) 


(1.8) 


and  hr,(x)  denotes  that  the  bandwidth  is  a 
function  of  x.  If  the  density  _/(x)  is  known, 
the  estimate  of  m(x)  is  simplified  as 

m^iix.hn  (.v))=  rJ,x.hnM)/f  (x).  (1.9) 


We  can  show  that  the  larger  k  is  in  the 
condition  (C.2),  the  higher  is  the  convergence 
rate  of  the  estimator  m^(x,  hn(x))  with 
optimal  LB  w.  r.  t.  MSE(m„).  For  example, 
Epanechnikov’s  (1969)  kernel 


K(u)=  j  (1-u^ 


For  a  non-random  variable  X.  this  type  of 
estimator  was  studied  by  Muller  and 
Stadtmiiller  (1987). 

In  the  following,  the  notion  of  optimality 
always  refers  to  minimization  of  the  mean 
squared  error  (MSE)  of  the  estimate  for  local 
bandwidth  (LB)  selection,  or  of  the  integrated 
mean  squared  error  (IMSE)  of  the  estimate  for 
global  bandwidth  (GB)  selection. 

In  Section  2,  we  present  several  results 
about  the  uniformly  strong  consistency  of  the 
kernel  estimators  with  LB  and  we  also  point 
out  the  rate  of  this  convergence.  Mention  that 
the  conditions  we  use  are  weaker  than  in 
Mardle  and  Marron  (1985)  and  Mack  and 
Muller  (1987). 

2.  UNIFORMLY  STRONG  CONSISTENCY 

In  the  following  sections,  we  make  several 
restrictions  on  the  kernel  and  the  joint 
probability  density  of  (X.  V)  ; 

(C.l)  K  is  bounded,  continuous,  symmetric, 
and  has  finite  total  variation. 

(C.2)  Assume  that  K(x)6  (dcfination  of 

TUq  ^  is  in  Muller  and  Stadtmuller 
(1987))  for  k>  2  and  K^(x)  is  Integrable. 
(C.3)  The  probability  density  g(x.  y)  of  (X,  Y] 
has  up  to  (k+l)‘^  partial  derivatives  w,  r. 
t.  X.  Define 


is  in  TUq  2-  "The  kernel 


1--U 


is  in  7)Io.4. 


In  this  section,  we  discuss  the  convergence 
of  regression  estimation  with  LB.  We  define 

that  h„=h  n[r ^=r /I  where  T_^is  a  function 

of  X  and  0<a  <  <  b  <  «>.  Both  a  and  b  are 

constant  here.  First  we  consider  the  case  is 
non-random  variable.  But  because  will  be 
estimated  from  the  data,  secondly,  we  study  the 
properties  of  the  estimators  in  case  that  is 

chosen  as  random  variable.  For  simplifying  our 
local  bandwidth  considerations,  we  assume 
that  7(^  >£  >  0  on  finite  interval  /=1-M.  M], 

M  >0.  and  the  value  r^is  always  contained  in  [a 
b).  From  now  on  we  will  write  m„[T j).  r„(r  J.  and 
/„(rj  in  place  of  m„(x.hn(rj).  r„(x.hn(rj),  and 
fjilx-h  n(r  J),  respectively,  to  relieve  the  burden 
of  notation. 

Central  to  our  study  is  the  error  process 
m„(rj-m(x).  for  a<  z^<b. 

It  can  be  rewritten  as 
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W1 


m„(r^  )-m(jia  =  ^  IrJrJ-r  (x)]-  W1 

l/'n(r  J-/  WI 

)-mlx)l - - - .  (2.1) 

/M 


In  order  to  facilitate  our  main  discussions  in 
this  section,  we  state  the  following  result 
which  is  established  in  Yang  (1988).  Define 


rjzi-r  (x)=a  Jx)+/}  Jx). 

where  a„(x)=Er  n(x)-r  (x)  is  non-random,  and 
/}„(x)  =  r  n(r)-Er  ^(r)  is  random. 

Let  F(x.  y)  be  a  distribution  function  of  (X, 
y).  and  Fn(J(^.  y)  its  empirical  distribution 

function  based  on  an  i.  i.  d.  sample  (Xj.yj) . 

(X^V^.  i.  e.. 


5(n)=n  *^**'(log  logn/n)*'^.  (2.2) 

LEMMA  2.1.  [Uniform  strong  convergency 
of  LB  density  estimators).  Suppose  that  the 
property  in  (C.3)  is  true  for  density  f[x) 
instead  of  for  g[x.  y).  Condition  (C.l)  and 
(C.2)  hold  in  both  (i)  and  (ii): 

(i)  If  is  non-random,  then 


sup|/n(rj-/  W  =  o[  5(n)]  a  s  (2.3) 


(ii)  If"^  X  is  random,  then 

.sup|/n(rj-/  (x)j  =  q[5  ln)j  a  s  (2.4) 


Let  T  be  in  (a.  b).  According  to  Lemma  2,1. 
(2.1)  can  be  simplified  as 

(i)-m(x)+olmn  (T)-m(x)|=-^  Ir^,  (d 

^  (2.5) 

-r  (x)|-''^l/-Jd-/  WI. 

/  (X) 

In  the  case  of  Tlx)  unknown,  v.e  replace 
Tlx)  by  its  estimator  /„  in  (1.9).  In  other 
words,  we  have  to  consider  the  problem  of  the 
estimator  m^in  (1.6)  which  has  a  random 
denominator.  From  a  mathematically  deductive 
point  of  view,  this  is  a  difficult  feature  in  our 
studying  process.  Applying  (2.5),  the  problem 
of  convergence  of  is  simplified  as  the 
problem  of  convergence  of  r„  and  /„  .  The 
problem  of  convergence  of  /„  is  done  in 
Ixmma  2,1. 

Now  we  discuss  the  convergence  of  r„  .  We 
state  the  following  facts  which  arc  either 
established  by  traditional  techniques  or  in 
literature.  (For  a  good  reference  ,  see  Prakasa 
Rao  (1983.  p.33-48)).  We  write 


fn(x.  y)=^hi^.xixi^.J^>ii.y,). 


‘(=1 


where  I  is  an  indicator  function.  We  rewrite 


and 


According  to  (C.1)-(C.4)  and  Taylor's 
txpausicn,  wc  easily  derive  that 

sup  sup  la^tx)  I <  CjO  *'^***b*'.  (2.6) 

relaq 

and 

sup  sup  Ii3ntx)l<- — ;^suplF,|tv,  y)-F(x,  y)  I .  (2.7) 

>  rclabl  b„(a)x.!y 


where  Cj  is  constant.  C2=psuplV'l,  and  p  is 
the  total  variation  of  K.  From  the  result  of 
Kiefer  (1961)  for  t  continuous. 


RF(limn'^suplF„(x.y)-F(x,!y)i  /(log  logn/2)  ''^=1] 

f  j  -too 


=  1. 


(2.8) 


From  (2.8).  wc  obtain  that  there  is  constant 
C,  such  that 
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siipIFntx.yl-f'tx.y) 


I^Cpog  logn 
n 


,1/2 


a.  s  . 


(2.9) 


rr,(r,)-r  „(r2)  =0(S(n)) 


a.  s  . 


Relations  (2.7)  and  (2.9)  prove  that 

,1/2 


.sup  .sup  l/3„Cx)l<  — 

^  xtlab)  ^ 


^2  iloglogn! 

"nWi  n 


a .  s 


(2.10) 


Comparing  (2.6)  and  (2.10).  we  get  the 
following  result 


sup  .sup  lr„(T)-r  (x)  I <  C5(n)  a.s.  (2.11) 

^Te(ab| 


where  5(n)  is  in  (2.2).  Therefore,  we  have 
proved 


Proof.  According  to  (2.1  1).  we  have 
sup  .sup  jrn(rj)-r„(r2)|<  2C5  (n)  a.s. 

>■  r,rf;fa.b|  '  | 

COROLLARY  2.2.  Assume  that  is  a  random 
variable  and.  e  |a  b]  a.s  .f^is  a  Junction  of  x 
orxi  e  [a  b).  Suppose  (C.1)-(C.4)  hold  .  then 

suplrjzj-r  ^(tJ\=0(S  In))  a.s. 

X 

Proof.  By  applying  the  following  inequality 


THEOREM  2.1.  Assume  that  the  conditions 
{C.1)-(C.4)  hold,  then 


'■ri(rj-rn(t;^|^  su^rn(ri)-rn(r;^|  for  any  r,  and  r2 


sup  lr„(rj-r  (x)  1  =0(5  (n))  a.s. 


e  (a,  bl  and  for  any  x,  and  Corollary  2.1,  this 
lemma  is  done  immediately.  I 


In  the  situation  of  JIx)  known,  under  our 
conditions,  it  is  easily  implied  that  the 
convergence  of  mjTj  is  the  same  as  of  r„(rj. 

We  know  that  the  optimal  LB  choice  requires 
knowledge  at  the  point  x  of  unknown 
functions,  and  is  thus  not  available  in  practice. 

(see  Yang  (1988))  Some  people  study  whether 

« 

one  can  use  a  pilot  estimate  Zj^of  z^to  form  a 
data-driven  bandwidth  sequence  h{zj  in  such  a 

way  that  mjzj  is  as  efficient  as  m„(r  j.  For  the 

kernel  estiiuatijn  case.  Kricger  and  Pi.  hands 
(1981)  and  Abramson  (1982)  answered  in  the 
positive.  Mack  and  Muller  (1987)  proved 
similar  results  for  the  kernel  regression  case. 
Their  methods  of  attack  involved  tightness  and 
weak  convergence  of  some  error  process.  In 
this  article,  we  will  present  strong  consistency 
results  on  the  more  generally  data-driven  LB 

estimator  m^zj  under  simpler  conditions.  We 
state 

COROLLARY  2.1.  Suppose  that  conditions 
(C.1)-(C.4)  hold  .  Then  for  ary  Ziond  T2 
contained  in  [a.  b). 


Under  the  conditions  of  Corollary  2.2,  we 
have  the  following  important  fact. 

THEOREM  2.2. 

.s^lrn(rj-r  (x)l=0(5  (n))  a.s. 

Proof.  For  any  fixed  x.  and  Zj^  e  (a,  bl.  wc 
have 

('■n(tx)-hx)  I  <  .sup  Ifr,  (r)-r(x)  I  . 

re  la  bl 

Hence,  we  imply  the  following  inequality 
.sup  I  r  „(  tJ  -  r(x)  I  <  .sup  .sup  I  r  „(  t)  -  r(x)  I  . 

^  TG  [a  bl 

According  to  (2.1 1),  the  proof  is  done.  I 

Recalling  relation  (2.5)  and  appljdng  Lemma 
2.1.  Theorem  2.1  and  2.2,  we  complete  the 
proof  of  the  uniformly  strong  convergence  of 
regression  estimator  m„  for  the  case  f[x) 
unknown.  The  results  are  stated  as  following. 

THEOREM  2.3.  {Uniform  strong 
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consistenru  of  LB  regression  estimators). 
Suppo'.e  ihat  conditions  (C.l)  (C,4)  hold  in 
both  (i<  and  (ii): 


(i)  If  z^is  non-random  contained  in  la  ,  bl,  then 
suom„(rJ-ni  W  =  o[5(n)|  cls. 

(ii)  If  Zx  is  random  contained  in  (a.  bl. 
then 


suRm„(rJ-m  U1  =  o(5(n) 

XX  / 


a  s. 


where  I  is  finite  intereal. 
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Abstract 

This  article  provides  fairly  comprehensive 
information  about  the  existing  software  for  Bayesian  data 
analysis.  An  earlier  version  of  this  article  is  published  in 
Goei(1988).  Even  though  new  software  is  being  developed 
at  a  reasonable  pace,  the  Bayesian  software  available  for 
widespread  usage  is  still  in  its  infancy.  Thus  the  goal  of  a 
gener^  purpose  Bayesian  Statistical  Analysis  Package  is  a 
long  way  to  go.  Two  avenues  for  quickly  reaching  this 
goal  are  discussed  in  the  concluding  section. 


1.  Introduction. 

In  May  1986,  a  workshop  on  Bayesian  computing, 
to  discuss  various  issues  in  an  open  forum,  was  organized 
at  The  Ohio  State  University.  The  two  main  issues 
discussed  were  (1)  desirable  computing  environments  for 
Bayesian  statistical  analysis,  and  (2)  potentials  for  a 
Bayesian  Analysis  package.  Although  almost  all  the 
participants  believed  that  wide-spread  use  of  Bayesian 
methodology  will  not  become  a  reality  without  an 
interactive  Bayesian  statistical  analysis  package,  most 
agreed  that  it  is  too  early  to  push  for  one.  Diverse  points 
of  view  also  existed  about  the  environment  suitable  for  a 
future  package. 

However,  it  was  suggested  that  future  development 
of  a  package  will  become  easy  if  new  Bayesian  software  is 
compatible  with  an  existing  statistical  package  with 
excellent  data  handling  and  graphics  capabilities,  e.g.,'S®'. 
Development  of  a  Bayesian  'Bulletin  Board '  and  'Software 
Database'  accessible  via  networks  for  news  and  file 
transfers,  was  also  suggested  .  These  task  have  not  been 
initiated  as  of  now.  Hopefully,  such  an  initiative  may  be 
taken  in  the  Fall  '88. 

The  information  compiled  in  this  article  was 
provided  by  the  individuals  listed  within  the  parenthesis 
after  the  program  name.  We  did  not  have  access  to  any 
mailing  list  for  the  engineers  involved  in  risk  assessment 
and  reliability,  who  have  developed  several  special  purpose 
Bayesian  analysis  programs  which  could  be  adapted  for 
general  reliability  applications.  Thus  the  listing  of 
reliability  programs  is  rather  incomplete.  On  comparing 
similar  listings  in  Press(198''),  it  is  clear  that  impressive 
gains  have  been  made  in  the  development  of  software  for 
implementing  Bayesian  paradigm,  based  on  realistic 
specifications  of  prior  information,  via  approximations, 
numerical  analysis,  and  Monte  Carlo  integration 
techniques.  On  the  other  hand,  it  is  also  clear  that  only  a 


few  people  have  devoted  their  energy  in  developing 
Bayesian  analysis  software. 

The  available  software  is  listed  according  to  the 
following  categories:  general  purpose  data  monitor 
(Section  2);  Regression,  Time  Series  &  Econometric 
modeling  (Section  3);  Computation/Approximation  of 
posterior  distribution  features  (Section  4);  Elicitation  of 
prior  information  (Section  5);  Reliability  Analysis 
(Section  6)  and  Miscellaneous  (Section  7).  Our  views  on 
developing  a  general  purpose  Bayesian  Analysis  Package 
are  given  in  Section  8. 


2.  General  purpose  data  analysis 

Program  Name:  CADA  [Computer  Assisted  Data 
Analysis  Monitor,  1983  (CADA  Group)] 

Function:  CADA,  a  conversational  language  for  Bayesian 
analysis,  is  a  hierarchically  structured  system  with  several 
component  groups. 

Input:  On-line  raw  data  entry  or  data  files  to  be  loaded. 
Output:  Analysis  for  beta,  two-parameter  normal,  and 
multinomial  models  based  on  conjugate  priors; 
assessment  of  conjugate  priors  and  utility  functions;  full 
rank  Model  I  ANOVA  and  MANOVA  for  multifactor 
designs  using  conjugate  or  noninformative  priors; 
simultaneous  estimation  of  regression  in  m-groups; 
psychometric  methods;  EDA;  probability  distribution  and 
actuarial  functions. 

Language:  BASIC  Compiler  or  interpreter  required. 
Machines:  DEC-PDP-11(RSTS);  DEC-VAX-ll(VMS), 
PRIME,  HP-3000.  IBM  PC  version  to  be  released  soon. 
Documentation:  Novick,  M.L.et  al.(1983).  Manual  for 
the  Computer-Assisted  Data  Analysis  (CADA)  Monitor, 
Iowa  City,  lA:  CADA  Group,  Inc.. 

Availability:  Available  for  $600  per  copy  from  The 
CADA  Group,  Inc.,  306  Mullin  Ave.,  Iowa  City,  lA 
52240,  Tel.  #(319)  351-7200 

Program  Name:  BAYES  PAK(Barlow) 

Function:  A  menu  driven  collection  of  programs  for 
teaching  simple  Bayesian  analysis  concepts.  It  provides 
plotting  capability  for  data  and  for  various  densities  as 
well  as  analysis  and  simulation.  It  is  used  at  UC 
Berkeley  for  an  engineering  statistics  course. 

Input:  Menu  driven  interactive  environment  prompts  for 
input  parameters. 

Output:  The  program  provides  plotting  capability  for 
densities  involved  in  the  conjugate  Bayesian  analysis  of 
Binomial  and  Normal  data.  It  can  also  plot  two  densities 
for  different  parameter  specifications  simultaneously. 


Some  simulation  capability  using  Uniform  and  White 
noise  random  variables  is  also  available. 

Language:  BASIC 

Machines:  IBM  PC -AT  or  compatibles,  IBM  EGA  or 
CGA  graphics  card 

Documentation:  Barlow,  R.E,  BAYES  PAK,  Users 
Manual,  Berkeley,  CA:  University  of  California 
Availability:  Diskette  available  from  Prof.  Richard  E. 
Barlow,  Department  of  I.E.  &  O.R.,  University  of 
California,  Berkeley,  CA  94720 


3.  Normal  Linear  Regression,  Time  Series  & 
Econometric  models. 

Program  Name:  BATS  [Bayesian  Analysis  of  Time 
Series,  Release  1.1,  June  1987(West)] 

Function:  This  software  package  provides  a  completely 
menu  driven  collection  of  functions  that  can  be  used  for  a 
variety  of  activities  in  data  management,  analysis  and 
graphical  displays.  Bayesian  approach  to  time  series 
modeling  and  forecasting  is  based  on  a  wide  class  of 
dynamic  linear,  and  non-linear,  models  suitable  for  many 
types  of  time  series  data  arising  in  industrial  ,  economic 
and  scientific  investigations.  The  program  allows  data 
transformations  and  dynamic  model  definition,  specifying 
components  for  smooth  tfends,  described  by  polynomial 
functions  over  time,  regression  effects  of  independent 
variables,  additive  or  multiplicative  seasonal  components 
and  error  terms  as  welt  as  interactive  specification  of  prior 
distributions  on  model  components.. 

Input:  Menu  Driven  interactive  environment  prompts  for 
input  parameters.  No  knowledge  of  APL  is  necessary. 
Output:  Interactive  mode  for  data  description  and 
summaries  and  displays  in  numerical  and  graphical  forms; 
sequential  model  estimation;  numerical  and  graphical 
displays  of  features  of  fitted  model,  smoothed  estimates  of 
components  including  trend,  growth,  seasonal  effects  and 
factors,  regression  effects  and  parameters,  residuals,  and 
error  variances.  In  addition,  retrospective  fit  of  time  series 
and  step-ahead  forecasts  are  also  available.  The  numerical 
summaries  and  model  information  can  be  saved  on  disk 
file  or  printed.  Interactive  manipulation  of  graphic 
displays  for  report  production  is  also  possible. 

Language:  APL*PLUS/PC®  Release  6.3  or  later  (user 
must  have  the  interpreter) 

Machines:  IBM  PC,  AT&T  and  compatibles  with  a 
minimum  of  520K  RAM 

Documentation:  West,  M.,  Harrison,  J.  and  Pole, 
A. (1987)  BATS:  A  User  Guide,  Coventry,  England; 
University  of  Warwick 

Availability:  Available  for  private  or  academic  use  for  a 
nominal  charge  of  30  Pounds  Sterling  from  the  Bayesian 
Forecasting  Group,  Department  of  Statistics,  University 
of  Warwick,  Coventry  CV4  7AL,  England. 

Remarks:  Some  of  the  theoretical  developments  and 
applications  are  discussed  in  West,  Harrison  & 
Migon(1985)  and  West  &  Harrison(1986),  Harrison  & 
West(1987). 


Program  Name:  BRAP  [Bayesian  Regression  Analysis 
Program,  Ver.  2.0  (Abowd/  2^11ner)] 

Function:  Provides  a  unified  package  for  the  Bayesian 
analyses  of  the  normal  linear  multiple  regression  model 
(MRM)  with  multivariate  normal  errors  under  a 
noninformative  prior,  a  g-prior  or  a  natural  conjugate 
prior  distribution.  Some  data  uansformations  are  built-in 
and  IMSL®  could  be  used  for  others. 

Input:  Control  cards  in  JCL  format.  Data  files  loaded 
thru  JCL. 

Output:  Updates  the  prior  parameters;  provides  standard 
posterior  information;  Plots  raw  data  and  residuals, 
marginal  and  bivariate  contours  of  the  prior  and  the 
posterior  distributions  of  the  regression  coefficients, 
posterior  disuibution  of  the  realized  errors,  posterior 
distribution  of  linear  functions  of  coefficients;  quantiles 
of  posterior  distribution  for  nonstandard  models  can  be 
obtained  via  numerical  integration  and  Monte-Carlo 
routines. 

Language:  FORTRAN-IV 

Machine:  IBM-MVS  (may  need  some  modifications  for 
recent  IBM  compilers) 

Documentation:  Abowd,  J.M.,  Moulton,  B.  R.  and 
Zellner,  A.(1985)  The  Bayesian  Regression  Analysis 
Package.  BRAP  user's  Manual  .Version  2.0,  H.G.B. 
Alexander  Research  Foundation,  Graduate  School  of 
Business,  University  of  Chicago 
Availability:  Package  available  from  Prof.  Arnold 
2fe!lner,  University  of  Chicago,  Graduate  St  “tool  of 
Business  1 101  East  58th  Sueet,  Chicago  BL  60617  at  a 
very  nominal  cost. 

Remarks:  Other  contributors  to  the  developn.iit  of 
BRAP  include  F.  Finnegan,  S.  Grossman,  C.  Plos.  c',  P. 
Rossi,  A.  Siow,  J.  Stafford,  and  W.  Vandaele. 

Program  Name:  BRAP-PC  [Bayesian  Regress  on 
Analysis  Package  for  the  IBM  PC(de  Alba/  Rocha)] 
Function:  This  enhancement  of  BRAP  also  includes 
subroutines  for  Bayesian  disaggregation  and  constrained 
forecasting. 

Language:  FORTRAN  77 
Machines:  IBM  PC  and  PC  compatibles. 

Availability:  Available  from  Prof.  Enrique  de  Alba, 
Instituto  Technologico  Autonomo  De  Mexico  (ITAM), 
Rio  Hondo,  No.  1,  Mexico,  D.F.  01000  at  a  nominal 
mailing  &  diskette  charges. 

Program  Name:  SEARCH  [Seeks  Extreme  and  Average 
Regression  Coefficient  Hypothesis  (Leamer/Leonard)] 
Function:  A  user-orient^  package  for  Bayesian  inference 
and  sensitivity  analysis  that  pools  prior  beliefs  about  the 
regression  coefficients  with  evidence  embodied  in  a  given 
data  set.  Prior  beliefs  are  assumed  to  be  equivalent  to  a 
previous,  but  possibly  fictitious  data  set.  SEARCH 
offers  a  study  of  the  sensitivity  of  the  posterior  estimates 
to  changes  in  features  of  the  prior  beliefs  expressed  in 
terms  of  a  fictitious  data  set 

Input:  Formatted  or  free-format  card-image  files  or  on¬ 
line  CRT  input.  Input  files  can  be  prepared  on  SAS®, 
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BMDP®,  TSP®,an{l  SPSS®.  SEARCH  requires  a 
double  precision  version  of  IMSL®  library. 

Output;  Diagnostic  messages  for  debugging  syntax  errors 
are  available.  Program  reports  summary  of  prior  and  data 
information  received  and  computes  the  approximate 
posterior  mode  for  the  regression  coefficients  when  the 
prior  beliefs  are  modeled  as  having  a  normal 
distribution  with  a  prior  mean  r  and  a  prior  covariance 
matrix  V.  It  also  reports  the  sensitivity  of  the  modal 
estimate  to  changes  in  r  and  V  in  the  form  of  extreme 
bounds  for  any  linear  function  of  the  parameters  specified 
by  the  user. 

Language:  FORTRAN  IV.  The  manual  for  Version  6 
states  that  SEARCH  is  not  completely  in  FORTRAN 
source  code.  Several  of  the  subroutines  for  performing 
high  precision  arithmetic  are  object  code  modules  (written 
in  IBM  370  machine  code).  Bulk  of  the  SEARCH  is 
written  in  FORTRAN  IV  that  is  compiled  at  UCLA  on 
the  IBM  GI  Compiler. 

Machine:  IBM  370/3033 . 

Documentation:  Learner,  E.E.  and  Leonard,  H.  B.  (1985) 
User's  Manual  for  SEARCH-  A  software  package  for 
Bayesian  inference  and  sensitivity  analysis.  Version  6. 
Availability:  Available  for  Sicio  per  copy  from  Prof.  E. 
E.  Learner,  Department  of  Economics,  UCLA,  405 
Hilgard  Av.,  Los  Angeles,  CA  90024,  (213)  825-1011, 
on  an  IBM  OS  standard  label  9  track  1600  BPI  tape 
containing  four  card-image  files. 

Remarks:  This  version,  programmed  by  Arvin  Stidick, 
differs  from  Version  5  in  efficiency  of  computation  and 
economy  of  input/output  The  Manual  was  largely 
rewritten  by  Thomas  E.  Wolff.  A  latest  example  of  how 
SEARCH  can  be  used  is  given  in  Learner,  E.  E.  and 
Leonard,  H.B.(1983)  Reporting  the  fragility  of  Regression 
Estimates,  The  Review  of  Economics  and  Statistics  . 

Program  Name:  MICRO  EBA  ( Micro  computer  version 
ofSEARCH(Fowles)] 

Function:  This  main  program  is  the  micro  computer 
version  of  the  above  program  SEARCH 
Language:  GAUSS 

Machine:  Any  personal  computer  running  GAUSS 
software  package  Version  1 .46  or  higher. 

Availability:  Available  free  of  charge  from  Prof.  Richard 
Fowles,  Department  of  Economics,  Rutgers  University. 
Newark,  NJ  07102. 

Program  Name:  BRP  [  Bayesian  Regression  Program 
(Bauwens)] 

Function:  This  main  program  performs  Bayesian 
regression  analysis  for  various  standard  econometric 
models,  discussed  in  Dreze(1977).  The  prior  beliefs  are 
modeled  as  Poly-t  densities  evaluated  via  the  program 
PTD. 

Input:  Raw  data  as  card-image  files.  Input  is  echoed  as 
output. 

Output:  Posterior  parameters,  precision  &  standard 
deviations,  and  marginals  of  regression  coefficients; 
classical  regression  analysis,  posterior  residuals  and 
predictive  density  function  of  the  dependent  variable; 


conditional  posterior  with  given  precision,  conditional 
posteriors  of  some  regression  coefficients  given  the 
others,  marginalized  over  the  precision. 

Language:  FORTRAN  77 

Machine:  IBM  370/158  at  the  University  of  Louvain.  In 
near  term,  a  PC  version  is  possible. 

Documentation:  Bauwens,  L.  and  Tompa,  H.  (1977) 
Bayesian  Regression  Program  (BRP),  CORE  User's 
Manual  Set  #  A-5,  and  Tompa,  H.(1977)  Poly-t 
Distributions  (PTD),  CORE  User's  Manual  Set  #  C-9. 
Availability:  Available  for  5,000  Belgium  Francs  from 
Prof.  Luc  Bauwens,  CORE,  34  Voie  Du  Roman  Paays, 
B-1348  Louvain-La-Neuve,  Belgium. 

Remarks:  These  programs  have  been  developed  by  H. 
Tompa  under  the  guidance  of  Profs.  Jacques  Dreze  and 
Jean-Francois  Richard  and  with  assistance  from  Luc 
Bauwens,  Jean-Paul  Bulteau  and  Philippe  Gille. 

Program  Name:  BARMA  [Fully  Bayesian  Analysis  of 
ARMA  Time  Series  Models(Monahan)] 

Function:  A  collection  of  main  program  and  subroutines 
carries  out  the  Bayesian  Analysis  for  ARMA  time  series 
models  using  natural  conjugate  priors  as  described  in 
Monahan(1983). 

Output:  Programs  compute  the  posterior  and  predictive 
distributions  of  parameters  for  a  given  set  of  ARMA 
models  using  the  natural  conjugate  prior.  Graphical 
displays  are  obtained  via  SAS/GRAPH. 

Language:  FORTRAN  66 
Machine:  Portable 

Documentation:  Monahan,  J.(1980)  'A  Structured 
Bayesian  Approach  to  ARMA  time  series  models, 
I,II,Iir,  Technical  Reports,  Department  of  Statistics, 
North  (Carolina  State  University,  Raleigh,  NC. 
Availability:  The  package  available  on  tape  from  Prof. 
John  Monahan,  Department  of  Statistics,  North  Carolina 
Slate  University,  P.O.  Box  8203,  Raleigh  ,  NC  27695 
at  a  nominal  charge. 

Program  Name:  Sampling  the  Future  (Thompson) 
Function:  This  program  simulates  the  predictive 
distribution  of  a  set  of  future  observations  via  Monte 
Carlo  methods  as  discussed  in  Thompson  (1986). 

Output:  The  main  program  and  subroutines  provide  a 
Monte-Carlo  histogram  for  the  predictive  distribution  of  a 
future  observation  or  a  scattergram  of  samples  from  the 
predictive  disuibution  of  a  pair  of  future  observations. 
The  program  allows  as  many  as  10  ARMA  parameters  in 
up  to  3  AR  factors  and  up  to  3  MA  factors.  Thus 
multiplicative  seasonal  factors  and  the  difference  factors 
may  be  used  in  the  model.  Estimation  step  allows  either 
a  diffuse  or  a  conjugate  normal/  gamma  prior  distribution. 
Language:  FORTRAN  77  ANSI  standard. 

Machine:The  program  runs  on  any  machine  with  standard 
FORTRAN  77  compiler  and  IMSL®  library.  Future 
extensions  will  requires  a  graphics  terminal.  The  program 
will  run  on  a  PC  with  a  math  co-processor.  A  PC-AT 
type  machine  with  a  hard  disk  is  recommended. 
Availability:  Diskette  available  for  SIO  from  Prof. 
Patrick  Thompson,  Faculty  of  Management  Sciences,  The 
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Ohio  State  University,  1775  S.  College  Road,  Columbus 
OH  43210. 

Remarks:  Future  enhancement  plans  include  a  graphic 
display  of  predictive  distributions  and  to  add  the  algorithm 
for  prediction  from  a  set  of  ARMA  models  given  in 
Monahan  (1983). 

Program  Name:  Bayes  &  Empirical  Bayes  Shrinkage 
Estimation  of  Regression  Coefficients  (Nebebe) 

Function:  The  program  computes  Bayes  and  empirical 
Bayes  Estimates  for  a  multiple  normal  linear  regression 
model  in  which  the  prior  for  the  regression  coefficients 
and  the  precision  is  modeled  as  a  hierarchical  normal  with 

mean  p,  and  precision  The  hyperparameters  are 
assumed  to  have  various  diffuse  distributions,  [see  Nebebe, 
F.  and  Suoud,  T.W.  F.(1986).] 

Language:  FORTRAN,  requires  access  to  NAG®  library. 
Documentation:  No  separate  documentation  is  available. 
The  details  are  given  in  Nebebe,  F.  (1984)  Ph.  D.  thesis. 
Department  of  Mathematics  and  Statistics,  Queen’s 
University,  Kingston,  Canada. 

Availability:  Available  from  Prof.  F.  Nebebe,  Dept,  of 
Decision  Sc.  and  MIS,  Concordia  University,  1455  De 
Maisonnevue  Blvd.  West,  Montreal,  Quebec  H3G1M8, 
Canada 

Remarks:  This  program  provides  no  extra  capability 
beyond  BRAP,  SEARCH  or  BAP.  But  it  may  be  useful 
for  individuals  who  do  not.have  access  to  IMSL  package. 

Program  Name:  SHAZAM  [General  Econometrics 
program  (White)] 

Function:  The  program  provides  a  portable  FORTRAN 
program  for  general  econometric  modeling.  PC  version 
for  $250,  main  frame  version  for  $500-900.  The  author 
promises  that  the  next  version  will  include  a  Bayesian 
Inequality  regression. 

Availability:  Available  from  Prof.  Kenneth  J.  White, 
Economics  Department,  University  of  British  Columbia, 
Vancouver,  B.C.  Canada. 

Program  Name:  BTS  [Bayesian  Time  series 
(Carlin/Dcmpster)] 

Function:  This  program  package  carries  out 

computations  for  Bayesian  estimation  of  unobserved 
componentsCseasonal'/'nonscasonar)  in  monthly  lime 
scries  under  a  class  of  Gaussian  Mixed  models  as  described 
in  Carlin,  Dempster  and  Jonas(1985).  It  uses  likelihood 
based  methods  for  estimation  of  model  parameters. 
Output:  The  program  provides  posterior  estimates  of 
model  parameters.  A  non-portable  version  for  the  Apollo 
DNbOO  workstation  has  many  graphics  capabilities. 
Language:  FORTRAN  77  (Standard  ANSI) 
Documentation:  Description  of  the  program  is  available 
in  Carlin,  J.  B.(1987)  Ph.D.  Thesis,  Department  of 
Statistics,  Harvard  University 

Availability:  Available  free  of  charge  from  Prof.  A.P. 
Dempster,  Department  of  Statistics,  Harvard  University, 
Science  Center,  1  Oxford  Street,  Cambridge,  MA  02138. 


Program  Name:  PROC  SEQ  [Sequential  Scoring 
Algorithm(Blattenberger)] 

Function:  The  function  performs  iterative  computation  of 
forecasting  distribution  for  the  dependent  variable  of  a 
normal  linear  model  with  a  normal-gamma  prior 
distribution  or  optional  g-priors.  Scores  for  five  different 
scoring  rules  are  also  computed. 

Language:  STAT80  Procedure;  being  converted  to  SAS® 
PROC  MATRIX. 

Availability:  Available  free  of  charge  from  Prof.  Gail 
Blaitenberger,  Department  of  Economics,  University  of 
Utah,  Salt  Lake  City,  UT. 

Program  Name:  MAXENT  [Data  Analysis  by  Maximum 
Entropy  Principle  Version  1.17  (Jaynes)] 

Function:  This  beta  version  of  MAXENT  provides 
fitting  of  an  incompletely  specified  linear  model  of  the 
form  Y=X  F,  where  the  data  vector  is  Y,  the  'smearing 
matrix’  X  is  known  but  not  of  full  rank  and  the  elements 
of  the  vector  F  are  non-negative  adding  to  1.  The 
Maximum  Entropy  Principle,  see  Jaynes(1983)  finds  the 
solution  which  maximizes  the  entropy  of  the  probability 
distribution  of  F. 

Input:  This  interactive  program  requires  the  input  of 
accuracy  level  for  constraints  satisfaction. 

Output:  The  optimal  solution  is  obtained  iteratively, 
with  access  to  the  output  for  each  iteration. 

Language:  BASIC 

Machines:  IBM  PC  and  compatibles.  An  ASCII  source 
code  file  is  also  on  the  diskette  for  transporting  the 
program  to  other  micro  computers. 

Documentation:  Help  file  and  Manual  on  diskette. 
Availability:  Available  free  from  Prof.  Ed  T.  Jaynes, 
Department  of  Physics,  Washington  University,  St. 
Louis,  MO  63130. 

The  programs  briefly  discussed  below  have  been  written 
for  specific  applications  of  linear  models. 

Program  Name:  RECONDA  (Brailhwait,  Steven) 
Function:  This  C  program  incorporates  engineering  prior 
estimates  of  appliance  level  electricity  consumption  into  a 
statistical  analysis  of  household  hourly  consumption  via 
a  hierarchical  linear  model.  The  modeling  details  are 
given  in  Caves,  Hcrriges,Train,  Windle(1987). 

Machines:  IBM  PC  and  PC  compatibles 
Availability:  The  program  will  be  distributed  free  of 
charge  by  EPRI,  P.O.  Box  10412,  Palo  Alto,  CA  94303 
to  EPRI  member  utilities,  government  and  academic 
institutions. 

Program  Name:  Statistical  Cost  Allocation  (Wright, 
Roger) 

Function:  This  FORTRAN  77  program  implements  the 
indirect  cost  allocation  methodology  based  on  a  multiple 
linear  model  as  described  in  Wrighi(1983). 
Documentation:  The  program  description  and  listing  are 
given  in  Wright,  R.  and  Oberg,  K.(1983)  The  1979-80 
University  of  Michigan  Heating  Plant  and  Utilities  Cost 
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Language:  FORTRAN  77 

Documentation:  The  algorithm,  program  listing  and 
some  examples  are  given  in  van  Dijk,  H.  K.,  Hop,  J.  P. 
and  Louter,  A.  S.(1986)  An  algorithm  for  the 
computation  of  Posterior  moments  and  densities  using 
simple  importance  sampling.  Econometric  Institute 
Report  8625/A,  Erasmus  University,  Rotterdam. 
Availability:  Available  from  Prof.  Herman  K.  Van  Dijk, 
Econometric  Institute,  Erasmus  University  Rotterdam, 
P.O.  Box  1738-  3000  Dr.,  Rotterdam.  The  Netherlands. 
Remarks:  Some  standard  programs  for  the  method  of 
mixed  integration  [see,  van  Dijk,  Kloek  and 
Boender(1985)]  are  under  preparation  by  Prof,  van  Dijk. 

Program  Name:  Monte  Carlo  Integration  (Geweke) 
Function:  A  collection  of  programs  using  some 
interesting  methods  for  constructing  Importance  Sampling 
density  derived  from  the  asymptotic  sampling  theoretic 
densities  of  the  m.l.e.,  which  are  more  flexible  than  the 
multivariate  Student-t  density  used  in  van  Dijk  program. 
It  has  built  in  diagnostics  for  the  convergence  of  the 
numerical  approximation  to  the  true  values  almost  surely. 
Some  applications  of  this  methodology  are  also  give  on 
the  diskette. 

Language:  FORTRAN  77  with  a  double  precision 
version  of  IMSL®. 

Machines:  VAX-VMS  or  any  other  machine  with 
FORTRAN  compiler. 

Availability:  Available  on  diskettes  from  Prof.  John 
Geweke,  Institute  of  Statistics  and  Decision  Sciences, 
Duke  University,  Durham,  NC  27706. 


Allocation  Study,  Working  Paper  #352,  Graduate  School 
of  Business  Administration,  The  University  of  Michigan. 
Availability:  Available  free  of  charge  from  Prof.  Roger 
Wright,  Graduate  School  of  Business  Adminisuation, 
The  University  of  Michigan,  Ann  Arbor,  MI  48109. 


4.  Computation/Approximation  of  Posterior  Distribution 
Features. 

Program  Names:  BAYES  FOUR  &gr(  Smith,  A.F.M.) 
Function:  The  Bayes  Four  system  consists  of  a  library  of 
subroutines,  primarily  intended  for  numerical 
computation  of  multiple  integrals  in  interactive  mode. 
Posterior  distribution's  features  can  be  evaluated  for  a 
practical  implementation  of  the  Bayesian  paradigm  for  up 
to  6  parameters  using  numerical  integration  procedures 
and  up  to  20  parameters  using  Monte  Carlo  integration. 
The  gr  library  consists  of  subroutines  for  an  interactive 
color  graphics  system  which  can  be  used  to  reconstruct 
and  display  output  of  the  Bayes  Four  system.  For 
reference,  see  Smith,  Skene,  Shaw,  Naylor,  and 
Dransfield(1985). 

Input:  Solving  an  inference  problem  requires  writing 
main  program  for  calling  Bayes  Four  and  gr  subroutines. 
Output:  The  posterior  moments  and  marginals  can  be 
evaluated  by  calling  these  menu  driven  subroutines.  The 
gr  package  can  be  used  to  provide  graphical  displays  of 
the  univariate  and  bivariate  marginal  posterior  densities 
and  predictive  densities  from  outputs  of  Bayes  Four. 
Language:  Bayes  Four  in  FORTRAN  77;  gr  in  68000 
assembler,  C  and  FORTRAN77. 

Machines:  BAYES  FOUR  for  SUNIII  or  APPOLO 
workstations  However  gr  has  not  been  configured  for 
any  standard  graphics  system  or  workstation  yet 
Documentation:  Naylor,  J.  C.  and  Shaw,  J.  E,  H.(1985) 
BAYES  FOUR-  User  Guide  ;  Naylor,  J.  C.  and  Shaw,  J. 
E.  H.(1985)  BAYES  FOUR-  Implementation  Guide  ; 
Shaw,  J.  E.  H.  (1985)  gr  User  Guide.  All  these  are 
technical  reports  from  the  Nottingham  Statistics  Group, 
Department  of  Mathematics:  University  of  Nottingham  . 
Availability:  Available  from.ProL  Adrian  Smith, 
Department  of  Mathematics,  University  of  Nottingham, 
Nottingham,  U.K.NG7  2RD(cost  for  academic  use  $200) 
Remarks:  (i)  For  application  of  this  system  to  some 
interesting  applied  problems  in  pharmaceutical  indusU7, 
see  Racine,  Grieve,  Fluhler,  and  Smith  (1986)  .  (ii)  An 
enhanced  version  of  BAYES  3.5  is  available  from  Prof. 
L.D.  Perrichi,  Department  of  Mathematics  and  Computer 
Science,  Simon  Bolivar  University,  Apartado  8()659, 
Caracas  1080A,  Venezuela. 

Program  Name:  Simple  Importance  Sampling 
[Computation  of  Posterior  moments  and  densities  via 
Monte  Carlo  Integration  (van  Dijk)] 

Function:  This  program  approximates  multiple  integrals 
that  arise  in  the  posterior  moments  and  marginal  densities 
of  parameters  of  interest  in  econometric  and  statistical 
modeling,  via  importance  sampling  Monte  Carlo 
integration. 


Program  Name:  BAYES3/3D  [Multiparamcier  Univariate 
Bayesian  Analysis  using  Monte  Carlo  Integration 
(Stewart)] 

Function:  Bayesian  inference  for  univariate  response 
variable  using  Monte-Carlo  integration.  Up  to  nine 
parameters  allowed.  Can  handle  usual  random  sampling 
data,  interval  data,  censored  data,  binomial  data  at  different 
stresses  or  times. 

Input:  Data  and  control  cards  as  card-image  files. 

Output:  Displays  posterior  means  and  percentile  curves, 
hazard  rate  functions,  or  probability  of  failure(response) 
versus  stress  (dose)  or  time.  (References:  Stewart,  L. 
(1979,  83,  85). 

Language:  FORTRAN  77 

Machines:  A  graphics  terminal  is  highly  desirable  but 
not  absolutely  necessary.  Need  DISPLA  graphics 
software.  GKS  and  DI-3000  versions  are  being  written. 
Documentation:  Stewart,  L.  (1987)  User's  Manual  for 
BAYES3/3D,  A  program  for  multiparameter  univariate 
Bayesian  analysis  using  Monte  Carlo  integration. 
Availability:  The  program  was  developed  under  various 
Federal  contracts  at  Lockheed-Palo  Alto  Research 
Laboratory,  Palo  Alto  CA  94304.  Dr.  Leland  Stewart, 
will  provide  the  tape  in  individual  cases,  on  permission 
from  Lockheed. 


Program  Name:  LINDLEYJBAS  (Sloan) 
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Function:  This  BASIC  subroutine  performs  algebraic 
manipulation  and  constructs  the  expanded  formula  for  use 
of  approximating  the  ratio  of  two  integrals,  required  in 
the  evaluations  of  the  posterior  distribution's  features,  as 
discussed  in  Lindley(1980). 

Input:  The  program  prompts  for  the  number  of 
parameters  to  be  estimated. 

Output:  The  printout  gives  the  complete  algebraic 
equation  needed  to  approximate  the  ratio  of  integrals. 
Language:  MS  BASIC 

Machine:  IBM  PC  or  compatibles.  Special  printing 
customized  for  EPSON  series  of  printers. 

Availability:  Available  free  of  charge  from  Prof.  Jeff  A. 
Sloan,  Department  of  Statistics,  University  of  Manitoba, 
Winipeg,  Manitoba,  Canada  R3T  2N2. 

Program  Name:  SBA YES  (Tierney) 

Function:  The  system  consists  of  S®-functions  to 
compute  approximations  of  posterior  means,  variances 
and  marginal  densities  that  are  generally  more  accurate 
than  Lindley's  Method  mentioned  above  [see  for  reference: 
Tierney  and  Kadane(1986)]. 

Language:  FORTRAN  77  and  C.  Requires  access  to  the 
S®  package  for  implementation. 

Availability:  Available  free  of  charge  from  Prof.  Luke 
Tierney,  School  of  Statistics,  University  of  Minnesota, 
Minneapolis,  MN  55113. 


5.  Elicitation  of  Prior  Information. 

Program  Name:  BAYES  (Schervish) 

Function:  This  program  elicits  priors  and  finds  posterior 
and  predictive  distributions  for  samples  from  normal  or 
binomial  data  with  natural  conjugate  priors  or  mixed 
conjugate  plus  point  mass  priors.  It  also  handles  flat 
priors  over  bounded  regions  for  normal  data. 

Language:  FORTRAN  IV,  requires  access  to  IMSL®. 
Machine:  DEC -2060.  Graphics  are  good  for  GIGI® 
terminals  only. 

Availability:  Available  on  request  from  Prof.  Mark 
Schervish,  Department  of  Statistics,  Carnegie  Mellon 
University,  Pittsburgh,  PA  15213. 

Program  Name:  [B/D]  [  Beliefs  adjusted  by  Data 
(Golds  tein/WoofO] 

Function:  This  program  provides  an  interactive, 
interpretive  subjectivist  analysis  of  general  (partially 
specified,  exchangeable)  beliefs  as  described  in 
Goldstein(  1987a,  b,  1988). 

Output:  Provides  summaries  of  as  to  how  and  why 
beliefs  are  (i)  expected  to  change  and  (ii)  actually  change, 
as  well  as  system  diagnostics  based  on  comparison  of  (i) 
and  (ii). 

Language:  PASCAL 

Availability:  Available  at  cost  of  mailing  and  manual 
production  from  Prof.  Michael  Goldstein,  Department  of 
Statistics,  University  of  Hull,  Cottingham  Road,  Hull, 
U.K. 


6.  Reliability  Analysis. 

Program  Name:  BASS  [Bayesian  Analysis  for  Series 
Systems  (Martz)] 

Function:  This  program  performs  a  Bayesian  reliability 
analysis  of  series  systems  of  independent  binomial 
subsystems  and  components  for  either  prior  or  test  data  at 
the  component,  subsystem  and  overall  system  level.  It 
uses  a  beta  prior  for  the  survival  probabilities. 

Language:  FORTRAN  77 

Machines:  Portable.  Requires  DISPLA®  software 
package  for  graphics. 

Availability:  Free  of  charge  from  Dr.  Harry  F.  Martz, 
Group  S-1,  MS  F6()0,  Los  Alamos  National  Laboratory, 
Los  Alamos,  NM  87545. 

Program  Name:  BURD  [  Bayesian  Updating  of 
Reliability  Data  (Martz)] 

Function:  The  program  performs  Bayesian  updating  of 
Binomial  and  Poisson  likelihood  with  a  natural  conjugate 
prior  or  a  lognormal  prior  for  the  parameter.  The 
updating  for  lognormal  prior  is  done  via  Monte  Carlo 
integration.  These  models  are  used  in  nuclear  industry. 
The  program  is  a  proprietary  of  Babcox  and  Wilcox  Inc. 
Documentation:  AJimed,S.,  Metcalf,  D.R.,  Clark,  R.E. 
and  Jacobsen  J.A.  (1981)  BURD-  A  Computer  program 
for  Bayesian  updating  of  reliability  data  ,  NPGD-TM-582, 
Babcox  and  Wilcox  Inc.,  Lynchburg,  VA. 

Program  Name:  IPRA  [An  Interactive  Procedure  for 
Reliability  Assessment,  Release  2.1,  (Singpurwalla)] 
Function:  A  menu  driven  program  performs  a  prior 
assessment  based  on  expert  opinion  or  informed 
judgement  and  the  posterior  analysis  for  Weibull 
distributed  life  length  data  in  a  highly  interactive  manner 
[Sec  Singpurwalla(1988)].  It  also  allows  the 
incorporation  of  the  analyst's  opinion  on  the  expertise  of 
the  experts. 

Input:  On-line  data  entry  or  use  of  menu  option  to  store 
data  in  a  Hie  for  later  use. 

Output:  The  program  computes  the  marginal  and  joint 
posterior  densities  of  the  Weibull  parameters.  The  prior 
and  posterior  reliability  functions  for  a  specified  time 
interval  as  well  as  disuibutions  of  reliability  for  specified 
mission  times  can  be  computed.  These  quantities  can  be 
displayed  in  a  tabular  or  2-d/3-d  graphics  form  or  saved 
on  disk. 

Language;  IBM  BASIC 

Machines:  IBM  PC -XT  or  AT  or  compatibles  with  math 
co-processor  and  IBM  enhanced  or  color  graphics  adapters. 
Documentation:  Aboura,  K.  N.  and  Soycr,  R.(1986)  'A 
User's  msnualfor  an  Interactive  PC-Based  Procedure  for 
Reliability  Assessment.,  Tech.  Report  GWU/IRRA/ 
Serial  TR-86-14,  George  Washington  University, 
Washington,  D.C. 

Availability:  The  program  diskette  and  user's  manual  are 
available  from  Prof.  Nozer  Singpurwalla,  The  Institute  of 
Reliability  &  Risk  Analysis,  George  Washington 
University,  Washington,  D.C.  20052  for  $95. 
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Program  Name:  IPND[An  Interactive  PC-Based  System 
for  Predicting  the  number  of  defects  due  to  fatigue  in 
Railroad  Tracks(SingpurwalIa)] 

Function:  A  menu  driven  program  performs  a  Bayesian 
analysis  of  a  non-homogeneous  Poisson  process  with  a 
Weibull  intensity  function  in  which  the  assessment  of  the 
prior  information  about  the  parameters  is  induced  via  an 
engineering  model  based  on  S-N  curves, 
Singpurwalla(1986).  The  procedure  is  applied  to 
prediction  of  the  number  of  defects  due  to  fatigue  in 
railroad  tracks. 

Input:  On-line  data  entry  or  use  of  menu  option  to  store 
data  in  a  file  for  later  use. 

Output:  The  program  computes  the  marginal  and  joint 
posterior  densities  of  the  parameters  in  the  Weibull 
intensity  function.  The  prior  and  posterior  distribution  of 
the  number  of  defects  due  to  fatigue  over  a  time  period  is 
also  computed.  These  quantities  can  be  displayed  in  a 
tabular  or  2-d/3-d  graphics  form  or  saved  on  disk. 
Language:  IBM  BASIC 

Machines:  IBM  PC  or  compatibles  with  math  co¬ 
processor  and  CGA  or  EGA  graphics  board. 
Documentation;  Choksy,  M.  and  Daryanani,  S.(1987) 
'An  interactive  PC-Based  System  for  Predicting  the 
Number  of  Defects  due  to  Fatigue  in  Railroad  Tracks: 
User's  manual  "  Tech.  Report  GWU/IRRA/  Serial  TR- 
87-3,  George  Washington  University,  Washington,  D.C. 
Availability:  Program  diskette  and  user's  manual  are 
available  from  Prof.  Nozer  Singpurwalla.  The  Institute  of 
Reliability  &  Risk  Analysis,  George  Washington 
University,  Washington,  D.C.  20052  at  a  nominal 
charge. 

Remarks:  This  procedure  and  the  program  has  been 
adopted  by  The  Association  of  American  Railroads  for  the 
analysis  of  fatigue  defects  data  in  railroad  tracks.  It  is  just 
one  indication  that  availability  of  appropriate  software 
would  lead  to  a  widespread  use  of  Bayesian  methodology. 

Program  Name:  PRCDSIM  [Prediction  and  Simulation 
for  mixtures  of  exponentials(Sloan)] 

Function:  This  PL/I  program  performs  a  Monte-Carlo 
simulation  of  sampling  from  a  mixtures  of  exponentials 
model  using  a  method  proposed  by  Marsaglia.  It 
computes  Bayes  estimates  of  the  systematic  parameters 
and  reliability  function  &  predictive  intervals  for  future 
observations. 

Machine:  Portable.  Requires  access  to  IMSL®. 
Availability:  Available  free  of  charge  from  Prof.  Jeff  A. 
Sloan,  Department  of  Statistics,  University  of  Manitoba, 
Winipeg,  Manitoba,  Canada  R3T  2N2. 


7.  Miscellaneous. 

Program  Name:  DISCBDIF  (Stroud) 

Function:  This  SAS®  program  classifies  an  input  record 
into  one  of  the  two  normal  populations,  based  on 
training  samples  from  each  one.  It  uses  either  Geisser's 


discrimination  procedure  or  a  semi-diffuse  limit  of 
conjugate  priors. 

Language:  Requires  access  to  SAS®  package  and  SAS® 
PROC  MATRIX. 

Availability:  Available  free  of  charge  from  Prof.  Thomas 
W.F.Suoud,  Department  of  Mathematics  and  Statistics, 
Queen's  University,  Kingston,  Ontario  K7L3N6. 

Program  Name:  BPC  [Bayesian  Probabilistic 
Classification  (Bernardo)] 

Function:  This  is  a  main  program  for  implementing 
Bayesian  linear  probabilistic  classification,  as  discussed  in 
Bernardo(1988)  .  It  is  written  to  run  on  APPLE 
Macintosh.  It  will  be  supported  in  the  future. 

Language:  MS  FORTRAN  77 

Availability:  Available  free  of  charge  from  Prof.  Jose  M. 
Bernardo,  Department  of  Statistics,  Faculty  of 
Mathematics,  4^71  Valencia,  Spain 

Program  Name:  Generalized  Hypergeometric  Function 
(Chib) 

Function:  This  program  computes  the  generalized 
hypergeometric  func'Jon,  which  arise  in  the  Bayes  and 
empirical  Bayes  estimation  of  the  multiple  correlation 
coefficient  with  a  beta  prior,  [see  Tiwari,  Jammalamadaka 
and  Chib(1987)]. 

Language:  Gauss 

Machines:  IBM  PC  and  compatibiles  with  Math  8087 
Co-processor  and  at  least  512K  RAM. 

Availability:  Available  for  $5  from  Prof.  Siddanha  Chib, 
Department  of  Economics,  125  Professional  Building, 
University  of  Missouri,  Columbia,  MO  65211. 


The  CADA  monitor  was  the  first  and  the  only 
general  purpose  program  for  Bayesian  data  analysis.  It 
has  gone  through  several  enhancements.  Even  though 
CADA  was  demonstrated  at  several  SBIE  seminars  and  is 
available  in  various  machine  versions,  it  has  not  been 
accepted  as  'the  package'  for  Bayesian  data  analysis.  This 
is  mainly  because  all  analyses  in  CADA  are  carried  out 
under  a  noninformative  or  a  simplistic  conjugate  prior 
framework.  It  has  no  numerical  integration  .capability, 
thus  it  precludes  analysis  for  realistic  prior  specifications. 
Furthermore,  the  BASIC  language  does  not  provide 
today's  stale  of  the  art  computing  environment.  The 
graphical  interfaces  in  CADA  is  almost  non-existent. 
The  package  was  probably  installed  at  almost  all  US 
universities  with  Bayesian  faculty,  but  has  not  been  used 
extensively  for  teaching  courses.  Thus  CADA  has  been 
used  to  a  quite  limited  extant. 

Among  the  participants  of  the  Bayesian 
Computing  Workshop  at  OSU,  there  was  no  interest  to 
choose  CADA  as  the  base  for  the  future  development  of  a 
suitable  Bayesian  Package.  The  current  version  of  CADA 
monitor  seems  to  be  quite  obsolete  to  us  as  the  basic 
computing  environment  has  not  changed.  On  the  other 
hand,  the  package  is  now  being  marketed  by  a  private 


8.  Concluding  Remarks. 
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company.  Depending  on  their  future  development 
strategy,  the  algorithms  in  CADA  could  become  a  vehicle 
for  an  acceptable  system.  The  future  plans  of  the  CADA 
group  should  be  explored  before  deciding  on  a  strategy. 

The  implementation  of  the  Bayesian  paradigm  for  a 
realistic  data  analysis  requires  a  variety  of  numerical 
integration  and  approximation  routines.  The  growth  of 
the  methodology  and  software  for  this  has  been 
phenomenal.  But  there  is  a  long  way  to  go  for 
approximation  and  numerical  integration  procedures  and 
useful  graphical  displays  for  high  dimensional  problems. 

The  only  way  to  develop  a  quickly  acceptable 
Interactive  Bayesian  Software  Package  is  to  adopt  some  of 
the  existing  main  programs  and  subroutines  as  modules  in 
some  widely  used  statistics  package  which  is  available  for 
mini  and  micro  computers  and  add  more  modules  to  it  as 
the  new  methodologies  and  its  software  are  developed. 
Thus  one  does  not  have  to  develop  data  management  and 
graphics  capabilities.  In  addition,  the  students  and  data 
analysts  will  not  have  to  learn  yet  another  system.  It  is 
also  wise  to  develop  all  new  Bayesian  software  so  that  it 
could  be  incorporated  in  an  already  existing  and  widely 
acceptable  computing  environment. 

The  strategy  of  writing  all  Bayesian  software  in  S® 
compatible  routines  sounds  appealing  from  the  point  of 
view  of  researchers  in  Statistics  departments,  where  UNIX 
is  slowly  becoming  a  de  facto  operating  system.  This 
was  the  dominant  choice  of  the  participants  in  the 
Bayesian  Computing  workshop.  However,  S®  in  not 
accessible  to  a  large  group  of  statisticians  and  other 
researchers  in  Business  schools.  Economics  and 
Engineering  departments.  Thus  this  option  will  limit  the 
accessibility  of  the  proposed  system.  On  the  other  hand, 
it  is  about  time  that  most  of  us  agree  on  one  option. 

We  believe  that  a  suitable  package  for  this  purpose 
is  S®  Version  II,  if  it  is  supported.  Otherwise,  the  most 
appropriate  choice  is  MINITAB,  since  it  is  supported  and 
is  very  widely  used  for  teaching  and  data  analysis.  We  can 
expect  to  receive  some  cooperation  from  Minitab  Inc. 
with  a  suitable  proposal  ,  specially  since  there  are 
tremendous  prospects  for  additional  sales.  We  need  to 
quickly  settle  this  issue  if  one  wants  to  see  the  'Bayesian 
21st  Century'. 
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1  Introduction 

This  paper  outlines  a  system  called  Arizona, 
now  under  development  at  the  U.  of  Wash¬ 
ington.  Arizona  is  intended  to  be  a  portable, 
public-domain  collection  of  tools  supporting 
scientific  computing,  quantitative  graphics, 
and  data  analysis,  implemented  in  Common 
Lispf.31]  and  CLOS  (the  Common  Lisp  Ob¬ 
ject  System)[4]. 

Although  there  is  substantial  implementa¬ 
tion  of  some  of  the  modules  described  below, 
this  paper  is  more  a  description  of  a  design 
than  of  an  actual  program.  One  excuse  for 
writing  a  paper  on  not-yet  existing  software 
is  that  Arizona  is  intended  primarily  as  a  re¬ 
search  vehicle:  it  is  hard  to  predict  when,  if 
ever,  it  will  mature  and  stabilize  to  the  point 
of  robust  production-quality  code.  However, 
we  hope  that  the  ideas  embodied  in  its  design 
are  of  interest  in  themselves  and  of  use  in  fu¬ 
ture  scientific  computing  and  data  analysis 
systems  (eg.  a  “New  New  S”[2]). 

Discussion  of  the  philosophy  underlying 

’This  research  was  supported  by  the  Office  of 
Naval  Research  under  Young  Investigator  award 
N00014-86-K-0069,  the  Dept,  of  Energy  under  con¬ 
tract  FG0685-ER2500.  (The  system  has  benefited 
from  ideas  (and  sometimes  code)  contributed  by 
many  people,  including  Rick  Becker,  Andrew  Bruce, 
Andreas  Buja,  Pat  Burns,  John  Chambers,  Bill  Dun¬ 
lap,  Robert  Gentleman,  Peter  Huber,  Catherine  Hur¬ 
ley,  John  Michalak,  Wayne  Oldford,  Jan  Pedersen, 
Steve  Peters,  Werner  Stuetzle,  and  Alan  Wilks.) 


Arizona  can  be  found  in  [22,23,18,21,24,32]. 
Briefly,  the  design  is  motivated  by  our  belief 
that  an  ideal  system  for  scientific  computing 
and  data  analysis  should  have: 

•  One  language  that  can  be  used  for  both 
for  line-by-line  interaction  or  defining 
compiled  procedures. 

•  Minimal  overhead  in  adding  new  com¬ 
piled  procedures  (or  other  definitions). 

•  A  language  that  supports  a  wide  variety 
of  abstractions  and  the  definition  of  new 
kinds  of  abstractions. 

•  Programming  tools  (editor,  debugger, 
browsers,  metering  and  monitoring 
tools). 

•  Automatic  memory  management  (dy¬ 
namic  space  allocation  and  garbage  col¬ 
lection). 

•  Portability  over  many  types  of  worksta¬ 
tions  and  operating  systems. 

•  A  community  of  users  and  developers. 

•  Access  to  traditional  Fortran  scientific 
subroutine  libraries  or  equivalents. 

•  A  representation  of  scientific  data  di¬ 
rectly  in  the  data  structures  of  the  lan¬ 
guage. 


282 


•  Comprehensive  numerical,  graphical,  •  Collections,  which  requires  Common 

and  statistical  functionality.  Lisp  and  CLOS, 

•  Device  independent  static  output  graph-  •  Linear  Algebra,  which  requires  Basic 

ics.  Math  and  Collections, 


•  Window  based  interactive  graphics. 

•  Support  for  efficient  and  concurrent  ac¬ 
cess  to  large  databases. 

•  Documentation  and  tutorials,  both  pa¬ 
per  and  on-line. 

The  first  nine  points  (through  “access  to  For¬ 
tran”)  come  for  free  with  standard  Common 
Lisp  environments.  The  remaining  six  are 
the  research  aspects  of  Arizona. 

Because  of  limitations  of  space,  for  the  rest 
of  this  paper  we  are  assuming  that  the  reader 
is  familiar  with  Common  Lisp  and  CLOS  or, 
at  least.  Lisp  and  object-oriented  program¬ 
ming  in  general.  Others  who  wish  to  read 
this  paper  should  review  some  of  the  refer¬ 
ences  first. 

1.1  The  modules 

Arizona  is  divided  into  a  number  of  mod¬ 
ules,  with  limited  interdependencies,  to  per¬ 
mit  individual  modules  to  stabilize  and  be 
“released”  before  the  whole  system  is  com¬ 
plete. 

The  modules  are  divided  into  two  groups: 
a  numerical,  quantitative  kernel  and  an  in¬ 
teractive,  window-based,  scientific  graphics 
part. 

The  non-graphical  quantitative  kernel  is 
more  developed  at  present,  because  it  can  be 
implemented  in  an  efficient,  portable  way  us¬ 
ing  existing  standards  for  Common  Lisp  and 
CLOS.  The  quantitative  kernel  consists  of: 


•  Probability,  which  requires  Linear  Alge¬ 
bra, 

•  Database,  which  requires  Collections, 
and 

•  Statistics,  which  requires  Database  and 
Probability. 

The  current  design  for  the  graphics  part 
is  fairly  tentative.  Implementation  of  a 
portable  scientific  giciphlcs  toolkit  requires 
a  standardized  interface  between  Common 
Lisp/CLOS  and  the  large  variety  of  pro¬ 
prietary  or  proposed  standard  window  sys¬ 
tems  for  workstations  and  personal  comput¬ 
ers  (eg.  Symbolics  Genera  [36],  NeWS[33], 
X[29],etc.).  This  standard  (sometimes  called 
Common  Windows)  is  the  subject  of  intense 
activity  in  the  Common  Lisp  community[l3, 
28].  I  have  identified  three  modules: 

•  Constraints,  which  requires  Common 
Lisp  and  CLOS.  (This  module  might 
very  well  be  part  of  the  non-graphical 
kernel,  but  most  of  the  applications  we 
have  in  mind  at  present  are  in  graphics.) 

•  Quantitative  Graphics,  which  requires 
Common  Windows,  Collections.  Con¬ 
straints,  and  Linear  Algeb.'^a. 

•  Data  Analysis  Graphics,  which  requires 
Quantitative  Graphics  and  Statistics 

2  The  quantitative  kernel 

2.1  Basic  Math 


#  Basic  Math,  which  requires  Common  Basic  Math  consists  of  things  that  can  be 
Lisp,  reasonably  implemented  with  Common  Lisp 
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functions  and  primitive  Common  Lisp  data 
structures;  it  does  not  use  CLOS.  Included  in 
Basic  Math  are:  machine  constants,  special 
functions  (eg.  beta,  gamma)  extended  vector 
operations  (analogous  to  the  BLAS[15]  used 
in  Linpack[8]),  evaluation  and  interpolation 
(eg.  generic  continued  fractions)  Id  numer¬ 
ical  integration,  and  basic  random  number 
generators. 

2.2  Collections 

The  Collections  module  has  two  parts:  Ab¬ 
stract  Sets  and  Enumerated  Collections. 

Instances  of  an  Abstract  Set  class  are  used 
to  represent  one  of  the  sets  or  spaces  that 
arise  in  mathematical  computing.  Examples 
are  Integer-Interval,  Float-Interval, 
and  Vector-Space,  which  are  used  in  the 
Probability  and  Linear- Algebra  modules. 

The  Enumerated  Collection  classes  are  in 
part  modeled  on  the  Collection  classes  in 
Smalltalk-80[10];  instances  are  used  for  tradi¬ 
tional  compound  data  structures,  eg.  Trees, 
Queues,  Enumerated  Sets,  Dictionaries,  In¬ 
dexes,  etc.  Enumerated  Collections  are  heav¬ 
ily  used  by  the  Database  module. 

An  Enumerated  Collection  basically  serves 
as  a  framework  for  iterating  over  its  ele¬ 
ments.  A  simple  collection  might  be  repre¬ 
sented  by  a  list;  more  complex  collections 
permit  more  efficiency  for  specialized  ac¬ 
cess.  (Eg.  a  time  series  might  use  a  doubly 
linked  list  to  give  efficient  access  to  lagged 
observations;  discrete  data  might  use  an  n- 
dimensional  array  for  quick  access  to  thy  cells 
of  a  contingency  table.) 

2. a  Linear  Algebra 

The  Linear  Algebra  module  is  discussed  in 
detail  in  [21],  where  it  is  referred  to  as  Cac¬ 
tus.  It  provides  approximately  the  same 


functionality  as  Linpack[8]  and  Eispack[30]. 
However,  CLOS  allows  Cactus  to  operate  at 
a  level  of  abstraction  chosen  to  match  the 
initial,  high-level,  geometric  descriptions  of 
algorithms  given  in  standard  numerical  anal¬ 
ysis  texts[ll].  The  use  of  object-oriented 
programming  makes  the  implementation  of 
standard  algorithms  (eg.  a  QR  decomposi- 
ti  m)  easier  to  understand  and  modify  than 
the  versions  in  the  best  Fortran  libraries — 
without  sacrificing  efficiency  in  either  space 
or  time.  In  addition,  it  is  much  easier  to 
use  information  about  regular  structure,  pat¬ 
terns  of  sparsity,  etc.,  to  get  improved  per¬ 
formance  in  special  problems.  Also,  the 
higher  level  of  abstraction  permits  extensions 
to,  for  example,  computations  on  Hilbert 
spaces[14]. 

The  Linear  Algebra  module  provides: 
class  definitions  for  Vector-Spaces ,  class 
definitions  for  Vector-Transformations 
(Matrix,  Positive-Definite-Matrix, 
Householder,  Product,  etc.),  methods  for 
the  protocol  corresponding  to  the  alge¬ 
bra  of  linear  transformations  (transform, 
compose,  scale,  add),  methods  for  “ma¬ 
trix”  decompositions  (LU,  QR,  LQ,  SVD, 
eigen,  etc.),  and  the  ability  to  solve  systems 
of  linear  equations  and  least  squares  prob¬ 
lems  using  a  generic  pseudo-inverse  func¬ 
tion  that  can  be  applied  to  any  linear  trans¬ 
formation. 

2.4  Probability 

Inference  and  Monte  Carlo  simulation  (in¬ 
cluding  Bootstrapping)  are  supported  in  a 
unified  framework  through  a  protocol  for 
Probability-Measure  classes.  Probability 
meaisure  objects  are  responsible  for  gener¬ 
ating  samples  from  themselves,  computing 
their  quantiles,  and  computing  the  prob¬ 
abilities  of  appropriate  sets,  including  tail 
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probabilities.  The  defined  probability  mea¬ 
sure  classes  includes  the  standard  one-  and 
higher-dimensional  parametric  densities  and 
discrete  distributions,  and  non-parametric 
measures,  either  resulting  from  density  es¬ 
timates  or  the  empirical  measure  of  a  data 
set.  (It’s  worth  noting  that  simple  descrip¬ 
tive  statistics  like  mean,  median,  etc.,  are 
generic  functions  in  the  probability  measure 
protocol  and  are  applied  to  data  sets  by  view¬ 
ing  them  as  empirical  distributions.) 

2.5  Database 

The  Database  module  has  two  parts.  The 
first  concerns  the  representation  of  statisti¬ 
cal  data  by  collections  of  objects  and  is  fairly 
well  developed.  The  second  is  concerned 
with  providing  true  database  facilities;  ef¬ 
ficient  concurrent  access  to  large  (gigabyte) 
collections  of  objects  whose  identities  per¬ 
sist  beyond  the  lifetime  of  rf"  particular  Lisp 
address  space.  The  second  part  is  a  major 
research  topic  in  the  database  and  object- 
oriented  programming  communities[25,26]. 

2.5.1  Collections  of  objects 

In  most  statistical  packages,  data  sets  are 
represented  as  2  dimensional  arrays  of  float¬ 
ing  point  numbers.  Each  row  represents  an 
individual  and  each  column  represents  a  vari¬ 
able.  This  is  an  awkward  representation,  for 
example,  for  categorical  data,  and  for  data 
sets  with  more  complicated  structure,  such 
as  clustering  trees.  It  is  impossible  to  repre¬ 
sent  simple,  but  important,  contextual  infor¬ 
mation,  such  as  the  fact  the  a  negative  value 
for  height  must  be  an  error  or  that  height  at 
age  2  should  be  greater  than  height  at  age  1. 
An  array  representation  makes  it  difficult  to 
sort  and  select  subsets  without  losing  track  of 
important  correspondences,  such  as  the  fact 


the  row  17  in  the  array  of  subsurface  coal 
producers  represents  the  same  company  as 
row  25  in  the  array  of  all  coal  producers  and 
average  sulfur  content  is  column  3  in  subsur¬ 
face  coal  producers  and  column  5  in  all  coal 
producers. 

In  Arizona,  statistical  data  is  represented 
by  collections  of  objects.  The  advantages 
of  this  are  discussed  in  detail  in  [18].  Indi¬ 
viduals  are  represented  by  objects,  instances 
of  CLOS  classes.  Variables  are  represented 
by  generic  functions.  A  dataset  is  repre¬ 
sented  by  a  collection,  typically  a  list  or  one¬ 
dimensional  array. 

For  example,  in  analyzing  energy  con¬ 
sumption  data  for  cities  in  the  US,  the  data 
on  each  city  would  be  collected  into  an  in¬ 
stance  of  the  City  class.  A  particular  in¬ 
stance  might  look  like: 

{City  Seattle  : population  450000 
: cooling-degree-days  300  •••}. 

Statistical  variables  are  represented  by 
generic  functions.  To  get  at  the  values  in  the 
slots  we  use  automatically  defined  accessor 
functions;  (population  {City  Seattle}). 
The  use  of  generic  accessor  functions  gives 
a  unified  way  to  refer  to  slots  or  arbi¬ 
trary  functions  of  slots;  we  can  ask  for 
(log-population  {City  Seattle}) ,  where 
log-population  is  the  obvious  Lisp  func¬ 
tion. 

This  might  seem  inefficient,  compared  to 
conventional  systems,  where  defining  a  new 
variable  means  adding  a  column  to  an  ar¬ 
ray,  because  it  looks  eis  if  we  would  have 
to  call  a  procedure  every  time  we  wanted  a 
value  of  the  log-population  variable.  How¬ 
ever,  standard  Lisp  programming  techniques 
(lazy  evaluation  and  memo-ization  [1])  make 
it  possible  to  represent  variables  by  func¬ 
tions,  hide  the  additional  complexity  from 
the  user,  and  so  that  the  log-population 
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procedure  is  not  called  any  more  often  than 
is  absolutely  necessary. 

Each  object  has  an  identity  and  existence 
independent  of  any  collection.  So  the  same 
object  can  be  in  many  collections;  the  unique 
object  {City  Seattle}  would  be  a  member 
of  both  All-Cities  and  Northwest-Cities. 
Similarly,  generic  functions  are  defined  in¬ 
dependently  of  any  collection  and  can  be 
applied  to  any  object  (for  which  there  is  a 
method).  The  independent  identities  main¬ 
tain  the  important  correspondences  that  can 
be  hard  to  keep  track  of  in  an  array  based 
system. 

Also,  a  collection  may  contain  objects  of 
more  than  one  type.  For  example,  in  en- 
ergy  production  data,  it  might  prov«  use¬ 
ful  to  analyze  coal  and  oil  producers  to¬ 
gether.  but  to  define  separate  coal  and 
oil  producer  classes — to  allow  for  the  fact 
that  acres-strip-mined  is  not  a  rele¬ 
vant  slot  for  oil  producers.  In  that  case, 
all-energy-producers  would  contain  in¬ 
stances  of  at  least  two  different  classes. 

2.5.2  Persistent  Objects 

A  true  database  requires  objects  that  per¬ 
sist  beyond  the  lifetime  of  the  address  space 
in  which  they  were  created.  Arizona  will  be 
used  for  research  into  a  hierarchy  of  function¬ 
ality  relating  to  persistent  objects: 

1.  Making  a  copy  of  the  current  state  of 
an  object,  in  the  same  address  space. 
(There  are  some  non-intuitive  difficul¬ 
ties  in  this  seeminglv  trivial  task;  see 

[27].) 

2.  Saving  objects  to  disk. 

3.  .Automatic  checkpointing 

■1.  Objects  that  can  undo  certain  changes. 


5.  Objects  that  can  recover  some  number 
of  previous  states. 

6.  Objects  that  can  recover  any  previous 
state. 

7.  Object  identities  that  persist  beyond  a 
particular  address  space  (rebooting). 

8.  Objects  that  can  recover  a  valid  state 
after  catastrophic  hardware  or  software 
failure[35]. 

9.  Sharing  objects  by  more  than  one 
user/address  space. 

10.  Efficient,  concurrent  access  to  large,  per¬ 
sistent,  shared  databa.ses. 

2.6  Statistics 

The  Statistics  module  represents  the  usual 
descriptive  statistics  by  generic  functions 
that  are  thought  of  as  functionals  on  mea¬ 
sures.  (All  the  usual  descriptive  statistics 
can  be  thought  of  as  functionals  on  mea.sures 
if  we  consider  a  dataset  to  be  a  measure  with 
total  mass  N.) 

Simple  statistical  functionals  take  a  collec¬ 
tion  and  one  or  more  variables  (Lisp  func¬ 
tions)  as  arguments.  For  example:  (median 
All-Cities  #’ log-population) . 
where  All-Cities  is  a  Collection  of  City 
objects  and  also  an  Empirical-Measure. 

Median  returns  a  number:  more  complex 
statistical  functional  return  instances  of  a 
Description  class.  A  Description  object 
remembers  its  training  sample  and  can  up 
date  itself  in  response  to  changes  in  the  train¬ 
ing  sample.  Of  particular  interest  are  Model 
objects,  which  are  Description's  that  are 
also  functions. 

For  example,  lea.st  squares  linear  regres 
.sion  takes  as  arguments  a  collection  in¬ 
tended  as  the  training  sami)le.  a  generic 
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function  representing  the  response,  and  a 
list  of  generic  functions  representing  the 
predictors.  The  result  of  the  regres¬ 
sion  is  a  Regression-Model  object.  The 
Regression-Model  object  fits  itself  to  the 
training  sample  by  1)  extracting  a  linear 
transformation  by  applying  the  predictor 
functions  to  each  object  in  the  training  sam¬ 
ple,  2)  extracting  a  response  vector  by  ap¬ 
plying  the  response  function,  3)  computing 
a  generalized  inverse  of  the  transformation 
via  QR  or  SVD,  and  4)  applying  the  gener¬ 
alized  inverse  to  the  response  vector.  The  re¬ 
gression  model  is  also  a  function  in  the  sense 
that  it  can  be  applied  to  any  appropriate  ob¬ 
ject  (whether  or  not  in  the  training  sample) 
to  predict  a  value  for  the  response.  In  addi¬ 
tion,  the  regression  object  is  able  to  compute 
and  report  appropriate  diagnostics  and  up¬ 
date  its  fit  in  reaction  to  inserting  or  deleting 
objects  in  the  training  sample  or  functions  in 
the  predictor  list. 

3  Scientific  Graphics 

The  kernel  described  in  the  previous  section 
is  useless  as  a  data  analysis  system — because 
it  lacks  any  graphics.  An  important  reason 
for  the  popularity  of  systems  tike  S  is  their 
convenience  and  flexibility  in  showing  pic¬ 
tures  of  data. 

Our  primary  goal  is  to  make  it  easy  to  im¬ 
provise  new  kinds  of  plots  without  losing  the 
performance  needed  for  interactive  and  mo¬ 
tion  graphics.  The  Quantitative  Graphics 
module  supports  this  goal  in  two  major  ways: 
a  defining  a  protocol  for  the  representation  of 
plots  by  hierarchical  display  objects  and  im¬ 
plementing  mechanisms  for  maintaining  con¬ 
straints  between  the  components  of  a  dis¬ 
play  object  (layout  constraints)  and  between 
a  window  and  the  object(s)  being  shown  in 


the  window  (viewing  constraints). 

3.1  Hierarchical  Display  Objects 

We  represent  a  plot  as  a  tree  of  Display-Node 
objects.  Every  Display-Node  has: 

•  a  parent  Display-Node.  The  root  of  the 
display  has  no  parent. 

•  a  list  of  children  Display-Nodes,  which 
is  empty  for  terminal  nodes. 

•  a  local  coordinate  system,  chosen  to  be 
convenient  for  describing  the  appear¬ 
ance  or  position  of  the  node.  For  exam¬ 
ple,  the  local  coordinate  system  might 
consist  of  xyz  position  coordinates,  rgb 
color  coordinates,  a  size  coordinate,  a 
theta  orientation  coordinate,  and  so  on. 
The  coordinate  system  is  represented 
by  an  instance  of  an  abstract  set  class, 
something  like  the  vector  spaces  used  in 
the  Linear  Algebra  module. 

•  appearance  and  position  parameters  that 
allow  the  node  to  be  treated  as  an  ele¬ 
ment  of  the  local  coordinate  system. 

•  a  local  viewing  transformation,  which 
takes  local  coordinates  to  the  local  co¬ 
ordinates  of  the  parent.  For  the  root 
node,  it  take  local  coordinates  to  screen 
coordinates — that  is,  pixels  and  pixel- 
values  representing  color.  The  relation¬ 
ship  between  the  local  viewing  transfor¬ 
mation  and  the  coordinate  systems  is 
like  the  relationship  of  the  linear  trans¬ 
formations  and  vector  spaces  in  the  Lin¬ 
ear  Algebra  module. 

•  a  list  of  layout  constraints  which  make 
assertions  about  relations  between  the 
sizes,  shapes,  viewing  transformations, 
or  local  coordinate  systems  of  descen¬ 
dants  of  the  current  node. 
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For  efficiency  in  motion  graphics,  Display- 
Nodes  may  add: 

•  a  total  viewing  transformation,  which 
is  obtained  by  composing  all  the  local 
transformations  between  the  node  and 
the  root  of  the  tree. 

•  a  factoring  of  the  total  viewing  transfor¬ 
mation  into  time-varying  and  constant 
parts. 

•  a  cache  holding  the  result  of  applying 
the  constant  factor. 

For  example,  many  implementations  of  ro¬ 
tating  scatterplots  implicitly  factor  the  view¬ 
ing  transformation  into  constant  translation 
and  scaling  and  a  time-varying  rotation.  If 
the  scaling  is  chosen  carefully,  the  rotation 
can  be  computed  in  integer  arithmetic  and 
produce  exact  screen  coordinates,  increas¬ 
ing  the  speed  of  rotation  by  as  much  as  100 
times — at  the  cost  of  non-modular,  machine 
specific  drawing  routines.  However,  we  can 
implement  the  basic  idea  in  a  modular  way, 
by  providing  methods  for  factoring  viewing 
transformations  analogous  to  the  matrix  de¬ 
compositions  provided  by  the  Linear  Algebra 
module.  The  same  paradigm  has  been  used 
in  higher  dimensional  graphics[12]  and  is  also 
applicable  when  color  or  shape  is  changing 
over  time  (rather  than  just  position). 

For  efficient  handling  of  input  (decid¬ 
ing  which  node  the  mouse  is  pointing  at) 
Display-Nodes  may  pre-compute  and  cache 
screen  coordinates — sometimes  of  a  single 
pixel,  but  more  frequently  of  one  or  more 
rectangular  regions. 

Some  Display-Nodes  are  Presentations, 
which  means  that  they  serve  as  a  visible 
representation  of  some  other  object  in  the 
programming  environment — the  subject  of 
the  presentation.  (This  discussion  is  very 


loosely  related  to  the  concept  of  presenta¬ 
tion  given  in  [9]  and  used  in  the  Symbolics 
Genera  system[36]  and  on  the  Model-View- 
Controller  user  interface  architecture  used  in 
Smalltalk[7].)  For  example,  a  point  in  a  scat- 
terplot  is  a  presentation  of  a  record  in  a  data 
set. 

A  presentation  is  related  its  subject  by  a 
viewing  constraint,  discussed  in  the  next  sec¬ 
tion. 

3.2  Constraints 

Constraints  are  abstractions  that  arise  natu¬ 
rally  in  many  statistical,  scientific,  or  graph¬ 
ics  problems[l,17,16].  A  constraint  lan¬ 
guage  allows  the  programmer  to  make  as¬ 
sertions  whose  truth  is  automatically  main¬ 
tained  in  the  course  of  subsequent  compu¬ 
tation.  Spreadsheets  are  a  widely  used,  if 
limited,  form  of  constraint  language.  A  full- 
fledged  constraint  language  is  a  major  re¬ 
search  undertaking  in  itself  [6,34,16].  We  in¬ 
tend  to  implement  at  least  two  less  ambitious 
constraints  systems: 

3.2.1  The  Viewing  Constraint 

The  basic  idea  is  that  a  window  is  a  view 
of  one  or  more  objects  and  should  always 
show  the  current  state  of  those  objects.  We 
have  a  fairly  good  understanding  of  how 
to  implement  this  type  of  constraint.  The 
basic  technique  is  similar  to  Active  Values 
in  LOOPS[5].  The  system  automatically 
triggers  appropriate  computation  whenever 
some  presentation’s  subject  is  modified.  The 
triggered  computation  may  take  place  imme¬ 
diately  or  may  be  put  off  until  a  valid  state 
of  the  presentation  is  needed  (eg.  until  the 
window  is  exposed). 

The  viewing  constraint  between  a  presen¬ 
tation  and  its  subject  determines  (1)  how 
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the  state  of  the  subject  is  reflected  in  the 
presentation  and  (2)  how  input  received  by 
the  node  affects  the  subject.  For  example, 
if  the  subject  is  a  city  object  in  an  energy 
consumption  database  and  the  presentation 
is  a  point  in  a  scatterplot,  the  viewing  con¬ 
straint  is  responsible  for  supplying  the  pre¬ 
sentation  with,  for  example,  population  as 
the  X  coordinate,  altitude  as  the  y  coordinate 
and  average  particulate  ppm  as  a  color  vari¬ 
able.  When  the  user  selects  the  point  with 
the  mouse,  the  viewing  constraint  is  respon¬ 
sible  for  performing  the  appropriate  action 
on  the  subject,  such  as  producing  an  editor 
window  that  lets  the  user  inspect  and  possi¬ 
bly  alter  the  slots  and  values  for  that  partic¬ 
ular  city. 

In  a  simple  case,  the  presentation  and  sub¬ 
ject  share  a  display  style  object.  The  display 
style  has  parameters  like  color,  size,  orienta¬ 
tion,  etc.  The  presentation  takes  its  appear¬ 
ance  directly  from  the  display  style.  When¬ 
ever  the  subject  changes  its  display  style,  the 
presentation  is  automatically  notified  to  re¬ 
draw  itself. 

Support  for  the  viewing  constraint  makes 
it  easy  to  implement  and  generalize  brushing 
scatterplots[20,19,3,32].  Earlier  versions  of 
brushing  were  based  on  a  special  plot  that 
contained  several  scatterplots,  each  showing 
different  variables.  The  basic  design  could 
not  be  easily  extended  for  use  in  a  window 
system  where  arbitrary  scatterplots  might  be 
visible  at  any  time,  or  to  other  kinds  of  plots 
besides  simple  scatterplots. 

In  Arizona,  brushing  is  implemented  in  the 
following  way:  as  the  cursor  (or  brush)  moves 
over  a  point  in  a  scatterplot,  the  presenta¬ 
tion  is  “painted”  with  the  display  style  that 
was  loaded  on  the  brush.  The  constraint  sys¬ 
tem  causes  the  display  style  of  the  subject  (a 
record  in  the  database)  to  be  updated  au¬ 


tomatically  which  in  turn  causes  the  display 
styles  of  all  other  presentations  of  that  sub¬ 
ject  to  be  updated.  A  consequence  of  this 
design  is  that  all  exposed  plots  are  automat¬ 
ically  involved  in  painting.  No  plot  needs  to 
know  what  other  plots  are  on  the  screen. 

Extensions  to  other  types  of  plots  are  rea¬ 
sonably  straightforward. 

3.2.2  Layout  constraints 

Plot  layout  is  a  more  open-ended  and  diffi¬ 
cult  constraint  problem.  The  idea  is  to  pro¬ 
vide  the  data  analyst  with  a  language  for 
making  and  enforcing  assertions  about  the 
relative  sizes,  shapes,  or  positions  of  the  com¬ 
ponents  of  a  plot. 

A  typical  example — conceptually  trivial 
but  difficult  to  program — is  centering  labels 
around  the  sides  of  a  scatterplot.  The  source 
of  programming  difficulty  is  conflicting  co¬ 
ordinate  systems.  The  center  of  the  data 
region  is  naturally  expressed  in  data  coor¬ 
dinates.  Heights  and  widths  of  label  strings 
can  usually  only  be  determined  in  pixels,  for 
a  given  font.  The  mapping  of  the  data  re¬ 
gion  into  pixels  cannot  be  determined  until 
we  know  how  much  room  is  left  by  the  la¬ 
bels,  but  we  can’t  position  the  labels,  choose 
a  font,  and  determine  the  labJ  widths  and 
heights  until  we  know  where  the  data  region 
is  in  device  coordinates. 

What  we  will  need  to  support  layout  con¬ 
straints  is: 

•  a  specification  language. 

•  internal  representation. 

•  general  purpose  satisfier. 

•  hooks  for  user  supplied  satisfier  code. 

•  fast  specialized  satisfiers  that  respond  to 
common  perturbations  from  a  solution. 
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•  effective  ways  of  identifying  and  report¬ 
ing  under/over  constrained  problenms. 
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ABSTRACT 

MACSYMA  is  a  symbolic  manipulation  program  v\(hich 
can  solve  many  algebraic  problems.  The  feature  of 
MACSYMA  that  is  especially  useful  in  optimal  design 
problems  is  its  ability  to  manipulate  matrices  v^ith 
symbolic  entries.  This  paper  will  illustrate,  by  an 
example,  how  it  can  be  used  in  the  particular  problem 
of  designing  a  logistic  regression  experiment. 

1. INTRODUCTION 

This  paper  will  not  attempt  to  review  all  that 
MACSYMA  can  do.  The  purpose  of  this  paper  is  to  give 
some  indication  of  its  usefulness  by  showing  its  use 
explicitly  in  a  particular  problem.  The  illustration 
uses  simple  commands  and  shows  how  a  naive  user  of 
MACSYMA  can  utilize  some  of  its  very  basic 
capabilities.  The  algebra  in  the  recent  papers  Chaloner 
(1  907a.  b)  and  Chaloner  and  Larntz  (1900)  was 
obtained  using  MACSYMA  and  it  is  some  of  these 
applications  which  will  be  described. 

MACSYMA  is  documented  comprehensively  in  the 
MACSYMA  reference  manual.  More  effective 
introductions  are  given  in  a  book  by  Rand  (1904)  and 
an  excellent  collection  of  worked  examples  by 
Drinkard  and  Sulinski  (1901).  Statistical  applications 
are  described  by  Gong  (1903),  Steele  (19B5,  1900). 
Rand  (1900)  and  Chaloner  (1900).  In  Chaloner  (1900) 
the  use  of  MACSYMA  is  illustrated  in  the  problem  of 
design  for  estimating  the  point  at  which  a  quadratic 
regression  is  a  maximum  or  a  minimum.  The  optimal 
Bayesian  design  is  derived  and  examined,  using 
MACSYMA.  In  this  paper  MACSYMA  will  be  used  in 
another,  similar,  problem  of  design  for  logistic 
regression. 

MACSYMA  can  do  many  things.  The  aspects  of 
MACSYMA  described  in  this  paper  are  some  of  its 
simplest  capabilities. 

2.  OPTIMAL  DESIGN  FOR  LOGISTIC  REGRESSION 
A  logistic  regression  model  corresponds  to  a  binomial 
sampling  distribution  for  data  y.  Specifically,  for  n; 
observations  taken  at  a  value  Xj  of  an  explanatory 
variable,  the  response  y,  is  binomial  with  n,  trials 
and  probability  of  success  pfxj.e),  where  e;(Po,Pi)T 
and  the  probability  p(x,e)  is  related  to  x  by 

p(x|.e)  =  [1  ♦  exp(-3o  -  PiXj)  J-l  . 

We  think  of  a  design  as  a  probability  measure  on  a 
compact  design  space  X  which  puts  a  proportion  Ti(xj) 
of  the  observations  at  Xj.  If  there  are  a  total  of  n 


observations  with  n,  observations  at  Xj,  with  Inj^n, 
then  the  proportion  T\(xj)  is  nj/n. 

The  Fisher  information  matrix  is  the  matrix  of 
minus  the  expected  value  of  the  second  derivative  of 
the  log  likelihood,  with  the  expectation  taken  over  the 
samoling  distribution  of  the  data.  For  a  design  t\  we 
denote  this  information  matrix  as  n!(e,T\),  so  that 
1(0, Ti)  is  a  normalized  information  matrix. 

We  can  think  of  the  design  problem  as  choosing  a 
measure  t\  which  optimizes  some  function  of  the 
information  matrix.  In  Chaloner  and  Larntz  (1900) 
designs  are  found  which  maximize  the  expectation, 
over  a  prior  distribution  on  6,  of  a  function  of  the 
1(0, Ti).  In  particular  the  following  two  criteria  are 
maximized: 

(Pi(ti)  :  E  log  det  1(0, ti)  (1) 

and 

iP2(Ti)  =  -  E  tr  A(0)  l(0.Ti)-1.  (2) 

The  criterion  (1)  is  to  maximize  the  expected  value  of 
the  log  of  the  determinant  of  the  information  matrix 
and  the  criterion  (2)  is  to  minimize  (by  maximizing 
its  negative)  the  expected  value  of  the  weighted  trace 
of  the  information  matrix,  where  the  weights  may 
depend  on  0.  These  critera  can  be  justified  as 
approximate  Bayesian  criteria.  For  the  criterion  of 
maximizing  (2)  the  choice  of  A(0)  depends  on  what  is 
to  be  estimated  or  predicted.  Several  choices  of  A(0) 
will  be  discussed  in  Section  4. 

Chaloner  and  Larntz  (19BB)  show  how  to  find 
optimal,  or  close  to  optimal,  designs.  The  criterion 
must  be  evaluated  using  numerical  integration  and 
optimized  using  numerical  optimization. 

The  numerical  optimization  appears  to  be  dealt 
with  best  by  fixing  the  number  of  design  points,  k,and 
finding  the  best  design  for  that  number  of  design 
points.  The  values  of  Xj  and  ti,  are  found  numerically. 
A  search  over  several  values  of  the  number  of  design 
points,  k,  can  then  be  done.  As  k  is  increased  the 
maximized  criterion  should  become  larger  until,  if 
there  is  an  optimal  design  on  a  finite  number  of 
design  points,  it  stays  constant. 

If  a  design  is  found,  by  numerically  optimizing  the 
criterion  and  searching  over  a  number  of  design 
points,  it  is  possible  to  verify  that  the  design  found 
corresponds  to  a  global  optimum  of  the  criterion  over 
all  possible  design  measures  n.  A  necessary  and 
sufficient  condition  for  a  design  n  to  be  optimal  is 
that  the  Frechet  directional  derivative  of  the  criterion 


function,  in  the  direction  of  all  one  point  designs,  is 
non-positive.  These  derivatives  will  be  defined  in 
Section  4  where  it  is  demonstrated  how  MACSYMA  can 
be  used  to  find  the  criteria  and  their  derivatives. 

3.  THE  INFORMATION  MATRIX 

For  a  design  measure  -q  on  k  points,  . x^.  define 

the  function  w(x,e)  as  p(e,x)il -p(x,e)l.  Further  define 
the  following  for  i:l . k: 

Wj  =  W(0.Xj) 

ni  =^(Xi) 
k 

t  r  Z  TliWj 
i=1 

k 

xbar  =  Zt\  jWjXj 
i=) 

k 

s  :  Z  -q jWjfXi-x bar)2  . 

1=1 

Note  that  Wj.  t,  xbar  and  s  alt  depend  on  e  but  this 

dependence  has  been  dropped  to  simplify  the  notation. 
We  further  define  y  to  be  the  ratio 

reparameterize  the  problem  in  terms  of  p  and  3=3^. 
The  parameter  p  is  the  value  of  x  at  which 
p(x,e)  =1/2.  Redefine  e^:(p,3)  then  with  this 
notation  and  parametrization  the  matrix  1(0, q)  is: 

-3t(xbar  -  p) 

-3t(xbar  -  p)  s  •  t(xbar  -  p)2 

We  will  use  MACSYMA  to  show  that  the  inverse  of 
this  matrix,  1(0, q)"^ ,  can  be  expressed  as: 

{s  *  t(xbar-p)2t/(st32)  (xbar-p)/(3s) 

(xbar-p)/(3s)  1/s 


In  the  (c3)  command  the  matrices  i  and  iinv  are 
multiplied  together  to  verify  that  they  are  inverses 
(the  symbol  denotes  matrix  multiplication  and 
denotes  scalar  multiplication).  Without  the  expand 
command  the  resulting  matrix  would  not  be  so  easily 
identified  as  the  identity  matrix.  The  save  command 
was  used  in  (c4)  to  save  the  expressions  i  and  iinv  in 
filel. 

4.  DIRECTIONAL  DERIVATIVES 

The  directional  derivatives,  as  derived  in  Chaloner  and 
Larntz  (1988),  are  as  follows.  The  derivatives  for 
iP)(q)  and  iP2(q).  in  the  direction  of  the  design  which 
is  point  mass  at  x,  are  denoted  by  d^fq.x)  and  d2(q.x) 
respectively.  Recall  that  w(x,0)  is  p(0,x){l -p(x,0)i  and 
define  Y^as  (-3,x-p),  then: 


d^(q,x)  :  E  w(0.x)  v^l(0,q)'^y  -  2 


(4) 


and 

d2(q.x)  : 

E  w(0,x)  y^  l(0,q)"^A(e)  1(0, q)'^  y 

‘  'P2C'n)-  (5) 

For  a  design  qg  to  be  -optimal  the  function 
di(qo.x)  must  be  non-positive  for  all  x  in  X  and  for 
qg  to  be  ip2-optimal  for  a  particular  choice  of  A(0) 
the  function  d2(q,x)  must  be  non-positive  for  all  x. 

4.1  THE  DERIVATIVE  FORiPi 

We  demonstrate  using  MACSYMA  to  find  the  criterion 
(P](q)  and  the  derivative  d^fq.x).  A  record  of  using 
MACSYMA  to  do  this  is  given  as  Figure  2.  The 
matrices  i  and  iinv  are  read  in  using  the  loadfile 
command,  reading  in  from  the  file  created  in  Figure  1. 
It  is  seen  in  expression  (d2)  that  the  criterion,  the 
expected  value  of  the  log  determinant  of  the 
information  matrix,  can  be  expressed  as: 


Figure  1  is  a  record  of  a  MACSYMA  session  to  show 
that  this  is  indeed  l(0,q)'l.  in  the  UNIX  system  that 
I  use  to  run  MACSYMA  it  is  run  by  typing  "macsyma' 
and  this  is  shown  in  the  first  line  of  Figure  1  at  the  * 
prompt.  Instructions  typed  in  are  labelled  as  (cl), 
(c2),...  and  end  with  a  semi-colon;  corresponding 

output  is  labelled  as  (dl),  (d2) . A  name  followed  by 

a  colon  at  the  beginning  of  a  command  assigns  the 
name  to  the  resulting  expression.  For  example,  the 
matrix  1(0, q)  is  denoted  as  i  in  (cl)  and  its  inverse  as 
iinv  in  (c2).  As  MACSYMA  does  not  recognize  Greek 
lett»''s,  the  symbols  b  and  m  are  used  to  denote  3  and 
p  respectively. 


<Pl(q)  =  E  log  (P^ts)  . 

Only  the  part  of  the  derivative  that  is  multiplied  by 
w(x.e)  and  then  integrated  numerically  is  calculated 
as  expression  (d4),  that  is  y^l(0.q)‘^y.  The  expand 
command  simplifies  the  resulting  expression  and, 
recognizing  the  expansion  of  (xbar-x)^,  the  derivative 

IS: 

d](q,x) 

=  E  [w(x,e)  (l/t  •  (xbar-x)2/s  11  -  2. 

The  matrices  1  and  iinv  and  the  vector  v  are  saved  in 
file2  for  use  later. 
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%  macsyma 

This  is  UNIX  MACSYMA  Release  309.1. 

(c)  1976,1984  Massachusetts  Institute  of  Technology. 

All  Rights  Reserved. 

Enhancements  (c)  1984  Symbolics,  Inc.  All  Rights  Reserved. 
Type  describe (trade_secret) ;  to  see  Trade  Secret  notice. 
Type  exec ("man  macsyma");  for  help. 

(cl)  i  .-entermatrix  (2,  2) ; 


Is  the  matrix  1.  Diagonal  2.  Symmetric  3.  Antisymmetric  4.  General 
Answer  1,  2,  3  or  4 
2; 


Row  1  Column  1 
Row  1  Column  2 
Row  2  Column  2 
(latrix  entered 


(dl) 


b"2*t; 

-b*t* (xbar-m) ; 
s+t*  (xba’'-fr.)  "2; 


[  b  t  -  b  t  (xbar  -  m)  1 


[  -  b  t  (xbar  -  m)  t  (xbar  -  m)  +  s  ] 


(c2)  iinv; factor (invert  (i) ) ; 

Batching  the  file  /usr/macsyma. 309/share/invert. mac 
Batching  done. 

[  2  2 
(  t  xbar  -  2  m  t  xbar  +  m  t  +  s 

[  - 

[  2 


(d2) 

b  s  t 

1 

[ 

xbar  -  m 

[ 

( 

b  s 

(CJi 

eApci/iJ  ( X  .  i  i/iv'  /  / 

[  1  0 

(d3) 

t 

[  0  1 

(c4) 

save  ([filel), i, iinv); 

1 

xbar  -  m  1 

-  1 

b  s  ) 

) 
1 

1  ) 


(d4) 


[filel,  i,  iinv] 


(c5)  quitO; 


FIGURE  1 


(cl)  loadfile (filel ) ; 
filel  being  loaded. 

(dl) 

(c2)  expand(determinant  (i) ) ; 
(d2) 

(c3)  v;matrix ( [-bl , [x-ml ) ; 
(d3) 


done 

2 

b  St 

[  --  b  1 

(  1 

[  X  -  ra  ) 


(c4) 

(d4) 

(c5) 


expand ( t  r  anspose ( v ) . i i nv . v ) ; 

2 

xbar 

s 


save ( 1  f i le2 1 , i , i inv,  V)  ; 


2 

2  x  xbar  x 

- +  — 

s  s 


(d5) 


!file2,  i,  iinv,  v| 


1 

t 


(c6)  quitO; 


FIGURE  2 
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4.2  THE  DERIVATIVE  FOR  -j)  o 

The  weighted  trace  criterion  of  <P2-optimality 
corresponds,  approximately,  to  squared  error  loss  of 
estimation.  The  criterion  therefore  requires  that  the 
quantities  to  be  estimated,  or  predicted,  are 
carefullly  specified  by  the  experimenter. 

If.  for  example,  the  only  parameter  of  interest  is 
>1.  then  A(e)  can  be  written  as  c  with  (1.0)^. 
If  both  ;i  and  3  are  of  equal  interest  then  A(0)  is  the 
identity  matrix.  If  the  only  quantities  of  interest  are 
linear  combinations  of  u  and  then  the  matrix  A(e) 
will  not  depend  on  the  unknown  parameters. 

Alternatively,  if  a  nonlinear  function  of  u  3nd  is 
of  interest  then  the  matrix  A(e)  will  depend  on  the 
unknown  parameters.  For  example  it  is  often  of 
interest  to  estimate  the  value  of  x  at  which  the 
probability  of  success,  p(x.e),  is  a  particular  value. 
Suppose  that  we  want  to  estimate  Xq  where 
logitip(x.5,0)}  =  if.  then  XQ  =  p>-iffi'\  which  is  a 
nonlinear  function.  Standard  asymptotic  arguments 
give  A(0)  =  c(0)  c(e)T,  where  c(0)  is  a  vector  of 
derivatives,  c(0)  =  (l.-2fP'2). 

A  distribution  could  be  put  on  if  to  represent 
interest  in  estimating  several  percentile  response 
points  Xq.  This  is  a  standard  way  of  using  the 
weighted  trace  criterion  in  a  Bayesian  framework. 

Figure  3  shows  how  functions  can  be  created  in 
MACSYMA  that  calculates  d2(x.T\)  for  any 

choice  of  A(0).  These  functions  are  called  criterion 
and  deriv  respectively  and  their  use  is  illustrated,  in 
Figure  3,  for  finding  expressions  for  iP2(ti)  and  d2(x,0) 
when  p  is  the  only  parameter  of  interest.  These 
functions  are  created  in  (c2)  and  (c3).  The  matrix 
A(0)  for  estimating  p  alone  is  entered  and  used  to 
find  iP2('n)  and  d2(x,0)  in  (d5)  and  (d7)  respectively. 
The  dispfun  command  is  used  to  display  all  user 
defined  functions  in  (c8)  and  then  in  (c9)  the 
functions  are  saved  in  a  file  called  file3. 

4.3  EXAMPLES  OF  iP o-OPTlMALlTY 

The  example  in  Figure  3  is  the  criterion  where  we 
suppose  that  interest  is  in  estimation  of  p  alone. 
Then,  as  discusses  earlier.  A(e)  is  s.  with 
£  :  (1,0)1’.  As  shown  in  the  MACSYMA  output  we  have: 

<P2(ti)  =  -  E  .  (xbar  - 

and 


Suppose  alternatively  that  we  want  to  estimate 
Xq  -  M  ■*  '1  for  a  known  value  of  if.  For  example  in 

engineering  and  reliability  experiments  it  is 
sometimes  of  interest  to  estimate  an  extreme 
percentile  response  point,  such  as  the  point  at  which 
p(x.8)  =  0.95.  In  this  case  the  functions  defined  in 
MACSYMA  could  be  used,  with  an  appropriate  choice  of 
A(0).  to  show  that  the  criterion  and  derivative  can  be 
expressed  as: 

iP2('T)  =  ■  E  [-3‘^{t'l  *  (.if  -  ^ (xbar-jj))2p-25- 1  )|] 
and 

d2(n.x) 

=  E  [w(x,e)  (32st)-2{t(xbar-x)(3(xbar-p)-2f)  ♦  1 

Finally  suppose  that  we  are  interested  in 
estimating  several  percentile  response  points.  We  can 
put  a  distribution  over  if  to  represent  this  interest,  as 
some  points  can  be  of  more  interest  than  others. 
Then  the  matrix  A(0)  becomes: 


/  1  -E(er)/p2 

A(0)  =  I 

\  -E{2f)/p2  E(2r2)/p4 


For  illustration,  suppose  we  put  a  uniform  distribution 
over  [-1,1]  on  Z.  This  represents  an  interest  in 
calibrating  the  central  part  of  the  response  curve. 
Then  E(2f)  :  0  and  E(2f2)  =  1/3,  yne  use  of  the  two 
MACSYMA  functions  easily  leads  to  the  following 
expressions: 

<P2(t]) 

-  -  E  [P'2{fl  *  (xbar-Li)2/s  ♦(3p2s)'l}] 
and 


d2(Ti.x) 

=  E  [w(x.e)  (Pst)‘2{t(xbar-x)(xbar-p)  »  s}2] 
•  'P2(n)- 


d2('n,x) 

=  E[w(x,e)(p2st)-2 

K  {3^(t(xbar-x)(xbar-|j)  •  s)2  .  t2(xbar-x)2/3l] 

♦  <P2(n)  • 
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(cl)  loadfii''''f11e2)  ; 
file2  being  loaded. 

(dl) 


done 


(c2)  criterion(a)  :=  (a. iinv) [1, 1]  +  (a. iinv) [2, 2] ; 

(d2)  criterion (a)  :=  (a  .  iinv)  +  (a  .  iinv) 

1,  1  2,  2 

(c3)  deriv(a)  ;=  transpose  (v) .  (iinv. a. iinv) .v; 

(d3)  deriv(a)  :=  transpose(v)  .  ((iinv  .  (a  .  iinv))  .  v) 

(c4)  a;matrix  ( [1, 0] ,  [0, 0]  ) ; 

[10) 

(d4)  [  1 

[00] 


(c5)  criterion  (a) ; 

(d5) 


2  2 
t  xbar  -  2  m  t  xbar  +  tn  t  +  s 


2 

b  s  t 


(c6)  deriv(a); 

2 

(X  -  m)  (xbar  -  m) 

(d6)  (X  -  m)  ( - 

2  2 
b  s 

2  2 

(xbar  -  m)  (t  xbar  -  2  m  t  xbar  +  m  t  +  s) 

- ) 

2  2 
b  s  t 

2  2 

(X  -  m)  (xbar  -  m)  (t  xbar  -  2  m  t  xbar  +  m  t  +  s) 

.  b  ( - 

3  2 

b  s  t 

2  2  2 
(t  xbar  -  2  m  t  xbar  +  m  t  +  s) 

- ) 

3  2  2 

b  s  t 

(c7)  factor (expand (d6)); 


2  2 
(t  xbar  -  t  X  xbar  -  m  t  xbar  +  m  t  x  +  s) 

(d7)  - 

2  2  2 
b  s  t 

(c8)  dispfun (all ) ; 

(e8)  criterion(a)  :=  (a  .  iinv)  +  (a  .  iinv) 

1,  1  2,  2 


(e9)  deriv(a)  :=  transpose(v)  .  ((iinv  .  (a  .  Iinv))  .  v) 


(d9)  done 

(r.lO)  save  ([file!], i, iinv, v, criterion,  dcriv); 

(dlO)  (file3,  i,  iinv,  v,  criterion,  deriv] 

(cll)  qiiitO; 

FIGURE  3 
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5.  DISCUSSION 

I  have  also  used  MACSYMA  in  studying  optimal  designs 
for  other  problems.  Chaloner  ( 1968)  aescribes  using 
MACSYMA  to  examine  the  problem  of  designing  an 
experiment  to  estimate  the  point,  in  a  quadratic 
regression,  at  which  the  response  is  maximized  or 
minimized.  This  problem  is  studied  in,  for  example. 
Buonaccorsi  and  Iyer  (1986)  and  relevant  results  are 
given  in  Murty  and  Studden  (1972).  Both  Bayesian  and 
locally  optimal  designs  are  found  and  described  in 
Chaloner  (1987a)  and  the  use  of  MACSYMA  for  the 
proof  of  these  results  are  described  in  Chaloner 
(1988).  Aspects  of  MACSYMA  that  are  used  in  this 
problem  include.-  finding  a  generalized  inverse  of  a 
singular  matrix,  finding  the  roots  of  a  quartic 
polynomial,  taking  derivatives  of  functions  and 
plotting  functions  with  symbolic  arguments.  Other 
features  of  MACSYMA  that  I  have  used  in  design  and 
other  problems  are:  writing  a  FORTRAN  expression  for 
inclusion  in  a  program,  taking  a  Taylor  series 
expansion,  finding  integrals  and  taking  limits. 

MACSYMA  is  clearly  a  useful  tool  for  these  kinds 
of  algebraic  manipulations.  Although  1  have  not 
solved  any  problems  that  could  I  not  otherwise  have 
solved  by  careful,  time  consuming  hand  calculations.  1 
have  found  MACSYMA  extremely  useful,  fast  and 
accurate.  I  believe  that  the  initial  effort  in  learning 
how  to  use  MACSYMA  to  its  fullest  capabilities  is 
well  worth  it.  It  is  also  fun  to  use. 
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AN  INTRODUCTION  TO  CART^*^j  CLASSIFICATION  AND  REGRESSION  TREES 

Gerard  T.  LaVarnway,  Norwich  University 


1 •  Introduction 

The  use  of  binary  trees  to  perforin 
classification  provides  an  interesting 
alternative  tQ„classical  parametric 
methods.  CART  is  a  fascinating 
mathematical  theory  that  was  developed 
by  Leo  Breiman  (UC  Berkeley),  Jerome  H. 
Friedman  (Stanford),  Richard  A.  Olshen 
(UC  San  Diego)  and  Charles  J.  Stone  (UC 
Berkeley/UCLA)  culminating  in  the 
monograph  CART:  Classification  and 
Rogrsssion  Trsss  ^Breiman,  et.  al.) 

In  addition,  CART''  was  developed  into  a 
powerful  software  package  that  applies  a 
nonparametric  approach  to  classification 
and  regression  problems.  Specifically, 
the  software  arrives  at  prediction  rules 
in  the  form  of  binary  decision  trees. 

This  paper  will  discuss  CART^ 
methodology  as  a  tool  in 
analyzing/solving  classification 
problems.  In  addition,  CART^”  performs 
regression.  However,  this  paper  will 
focus  solely  on  the  classification 
problem.  The  procedure  which  performs 
regression  is  similar,  but  slightly 
different . 

An  appendix  is  provided,pWhich 
contains  the  complete  CART^  processing 
and  output  on  data  from  a  classification 
problem.  The  entire  CART^”  output  is 
provided  for  completeness. 

2 .  Statsmnt  of  th*  Problem 

The  general  classification  problem  may 
be  described  as  follows:  Given  a 
multivariate  observation  z  which  is 
known  to  belong  to  (emanate  from) 
one  of  n  possible  populations 
(platforms),  determine  which  population 
is  most  likely.  The  analyst  who  is 
performing  this  classification  has  an 
historic  data  base  of  observations, 
for  each  of  which  the  actual  population 
is  known,  and  has  suspicions  -  in 
the  form  of  prior  probabilities 
regarding  the  likely  population  of  z. 

For  clarity,  let  us  define  our 
measurement  vector  x  to  be  an 
N-dimensional  vector  x  =  (x,,  x.,.x^,  ... 

x^).  CART  allows  for  the  variables 
,  to  be  of  continuous  and/or 
categorical  type.  A  continuous  variable 


is  a  variable  that  takes  on  real 
numbered  values.  A  categorical  variable 
is  a  variable  that  assumes  a  value  from 
a  discrete  set,  (e.g.  {red,  blue, 
green)).  The  vector  x  is  known  to 
belong  to  one  of  j  classes  j  =  1,  2,  ... 

J. 

In  performing  classification,  an 
analyst  records  the  observation  vector, 
X,  of  an  object  and  predicts  the  class, 
j,  to  which  the  object  belongs. 

Sample  classification  problems  are  as 
follows : 

o  At  the  University  of  California 
at  San  Diego  (UCSD)  Medical  Center, 
incoming  heart  attack  patients  are 
monitored  on  17  different  variables 
(blood  pressure,  age,  etc.).  The  medical 
staff  would  like  to  predict  if  the 
patient  is  in  a  high  or  low  risk  of 
death  (Breiman,  et  al.  1984). 

o  Determine  a  ships  class 
(destroyer,  cruiser,  submarine, 
battleship,  aircraft  carrier,  etc.)  from 
surveillance  observations. 

o  Predict  a  college  freshman's 
success  or  failure  in  his/her  first 
mathematics  course  from  various  previous 
test  measurements  (e.g.  SAT  scores, 
etc .  )  . 

3.  CART^*^  METHODOLCX5T 

This,p^ection  provides  a  brief  summary 

Crtt.:'  p"  ~''"sin^^  For  a  complete 
description  of  CART^  and  its  supporting 
theory,  the  reader  should  consult  the 
monogra^j;^,  Breiman,  et.  al.  (  1984). 

CART  arrives  at  its  classification 
rule(s)  by  producing  a  binary  decision 
tree  which  partitions  a  set  into 
disjoint  subsets.  This  partition  hat 
the  property,  that  for  any  element  of  a 
given  subset,  a  class  can  be  assigned. 

A  sample  classification  tree  might 
look  as  follows: 
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Sample  Classif ication  Tree 


figure  2-1 

To  construct  a  binary  decision  tree, 
a  learning  sample  is  required.  A 
learning  sample  is  a  set  of  measurement 
observations,  for  which  the  true  class 
of  each  measurement  observation  is 
known.  It  is  desired  to  include  in  the 
measurement  vector,  all  variables  wnich 
are  believed  to  have  some  predictive 
power  in  determining  the  classification 
of  the  measurements.  CART^^  uses  the 
learning  sample  to  construct  a  decision 
tree  that  can  then  be  used  to  classify 
an  observation  whose  class  is  unknown. 

In  the  sample  decision  tree  (figure 
2-1),  we  observe  CART  's  partitioning 
of  the  space  into  descendant  nodes 
(subsets).  The  square  boxes  indicated 
ter.Tiinal  nodes.  A  terminal  node  is  a 
node  at  which  a  class  assignment  can  be 
made.  The  circular  nodes  indicate 
nonterminal  nodes,  where  a  class 
determination  cannot  be  made.  At  each 
nonterminal  node,  a  binary  (yes/no) 
question  is  asked,  "splitting"  that  node 
into  two  descendant  sub  -  nodes,  which 
may  or  may  not  be  terminal  nodes. 

The  concept  of  splitting  is 
fund^jijental  to  CART^  processing. 

CART  allows  for  three  different  types 
of  splits: 

1)  univariate  splits  on  a 
continuous  variable:  is  <C  ,  C  a 

fixed  real  number. 

2)  linear  combination  splits  on 


continuous  variables: is  c.x  +  c.  x-  ... 

-L  1  3  3 

+  C  X  <C,  C  a  fixed  real  number 
n  n  -  ' 

3)  splits  on  a  categorical 

variable:  is  x  is  an  element  of  a 

n 

finite  set  s,  then  ask  the  question,  "Is 
e  S",  where  S  ranges  over  all 

possible  subsets  of  s- 

A^gther  natural  question  is  "How  does 
CART^  aecide  on  a  particular  split  for  a 
given  noae?"  The  choice  of  a  split  is 
made  on  the  notion  of  impurity.  CART^ 
chooses  the  split  that  minimizes  the 
impurity.  "What  is  meant  by  impurity?" 
Def inition  3.1:  Call  0  =  ^(p^^,  P2/  ••• 

p, ),  a  function  of  non-negative 
^  k 

arguments  with  £;?.;=  l,an  impurity 
j  =  l  J 

function  if 

1 )  0  >0 

2)  0  is  maximum  when  p^  =  p^  =  ... 

Pk 

3) 0=0  when  p^  =  1  for  some  j 

wliere  k  is  equal  to  the  number  of 
classes . 

Example  3.1:  The  entropy  measure  of 
impurity  is  given  j^y 

4>  "  -  Z  Pt 

]  =  1  J  ^ 

With  OlogO  =  0 

Example  3.2:  The  Gini  measure  of 
impurity  is  given  by 

k  2 

0  =  1  -  X  Pj 

3  =  1  ^ 

Once  an  impurity  function  has  been 
defined  and  selected,  we  ultimately 
define  the  impurity  of  a  node  and  the 
impurity  of  a  tree. 

Definition  3.2:  The  impurity  ...  (t)  of  a 

0 

node  t  is 

i^(t)  =  0(p(llt),  p(2|t),  ...  p(k|t)) 

where  p(jjt)  is  the  estimated 
probability  of  a  class  j  object  at  node 

t . 

Definition  3.3:  The  impurity  I  (T)  of  a 

♦ 

tree  T  is 

I  (T)  =  E  1  (t)p(t) 

0  0 

t«T 
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where  T  denotes  the  set  of  terminal 
nodes  of  T  and  p(t)  is  an  estimate  of 
the  probability  that  a  case  falls  into 
node  t . 

With  the  above  definitions,  CART 
selects  the  split  that  minimizes  the 
overall  tree  impurity. 

We  now  have  established  the  pr  cedure 
for  splitting  a  nonterminal  node  into 
descendant  aubnodes.  However,  the  issue 
of  how  CART'^  selects  the  optimal 
decision  tree  for  classif ication  has  not 
been  ad^gessed. 

CART'^  continues  splitting 

(partitioning)  until  an  overly  large 

tree,  T_  is  grown.  That  is,  a  tree 
max  ^ 

with  all  terminal  nodes  pure  or  have  a 
count  less  than  or  equal  to  some  small 
number  (default  5).  A  process  known  as 
"pruning"  generates  a  nested  sequence  of 
subtrees.  A  subtree  is  created  by 
pruning  off  a  branch  or  branches  from 
the  previous  subtree.  Selection  of  the 
branch  to  be  pruned  is  done  by  a  cost 
complexity  measure. 

The  cost  complexity  measure  is  a 
measure  of  the  resubstitution  estimate 
of  misclassif ication  and  a  "penalty"  for 
the  complexity  of  the  tree. 

Definition  3.4:  For  a  given  tree  T  let 

M  (T), 

a 

M^(T)  =  R{T)  ^  alT| , 

be  the  cost  complexity  of  T  with 
complexity  parameter  o/  a  J  0.  R(T)  is 

the  resubstitution  estimate  for 
misclassif ication  cost  of  tree  T.  |T|  is 
the  number  of  terminal  nodes  in  tree  T. 

We  see  from  the  above  definition  that 
by  increasing  the  value  of  a,  we 
increase  the  penalty  for  the  complexity 
of  the  tree. 

By  the  pruning  technique,  CART 

generates  a  sequence  of  nested  subtrees 

T.,  T^,  ...  T„,  .  Associated  with  each 

1  2  max 

tree  in  this  sequence  is  an  estimate  of 
the  misclassif ication  cost  for  that 
tree.  Three  methods  for  estimating  this 
misclassif ication  cost  are  available: 
resubstitution,  test  sample,  and  cross 
validation,  (see  Breiman  et  ai.  1984,  pp 
72-81  ,  . 

N^l,iirally,  one  would  think  that 
CART  “■  selects  the  subtree  with  the 


minimum  misclassif ication  cost. 

However,  there  is  some  uncertainty 
associated  with  ;^J^e  misclassif  ication 
estimates.  CART^  resolves  this 
uncertainty  by  calculating  their 
standard  errors  (SE)  (see  Breiman,  et. 
al.  1984  pp  78  -  81)  . 

CART  then  selects  the  subtree  with 
the  least  number  of  terminal  nodes, 
within  one  (1)  standard  error.  The 
decision  to  select  the  subtree  with  the 
minimum  number  of  terminal  nodes,  is  due 
to  the  fact  that._a  simpler  tree  is 
preferred.  CART^  allows  the  user,  as 
an  option  during  execution,  to  vary  the 
SE  rule.  For  example,  if  the  user 
desires  the  tree  with  the  absolute 
minimum  misclassif ication,  set  the  SE 
rule  to  O.OSE.  Any  variation  of  this  SE 
rule  is  allowed,  2SE,  1.5SE,  etc. 

Once  an  optimal  subtree  has  been 
selected,  objects  whose  class  is  unknown 
may  be  passed  down  the  decision  tree  for 
classification . 

The  final  issue  of  importance  is  "how 
is  the  class  assignment  performed?" 

This  is  done  in  the  most  natural  way. 

If  the  misclassif ication  costs  are 
equal,  the  assignment  at  each  terminal 
node  is  the  most  populous  class  of  the 
learning  set  in  that  node.  If  the 
misclassif ication  costs  are  not  equal, 
CART^  assigns  class  j*  to  node  where  j* 
minimizes 

j;  C(  j  |m)p(m|t) 

where  C(jjm)  is  the  cost  for  classif ing 
a  class  m  object  as  a  class  j  object  and 
p(m|t)  is  the  probability  of  class  m 
object  at  node  t.  .p 

To  summarize,  CART^  constructs  a 
binary  decision  tree  in  the  following 
manner : 

1)  Produce  an  overly  large  tree 
using  binary  questions,  minimizing  tree 
impurity  at  each  step. 

2)  Prune  this  large  tree 
generating  a  nested  sequence  of 
subtrees,  each  with  an  associated 
misclassif ication  cost. 

3)  Select  the  optimal  tree  for  use 
as  a  classifier. 

Any  discussion  of  CART  would  not  be 
comp.^gte  without  mentioning  some  of 
CART^  's  nonstandard  features,  that  make 
the  software  so  attractive.  A  list  of 
the  nonstandard  features  that  I  find 
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useful  are  as  follows: 

1)  CART^  is  a  nonparametric 

approach.  It  is  nonparametric  in 

the  sense  that  it  places  no  restrictions 
on  the  distribution( s )  of  any 
variable(s)  (e.g.  normality  is  not 
assumed ) . 

2)  More  classical  statistical 
methods  cannot  deal  with  missing  data  in 
a  natural  way.  CART'^  handles  missing 
data  by  the  use  of  "surrograte_splits" . 
When  a  split  is  selected,  CART^ 
measures  the  association  of  splits  on 
other  variables  to  the  chosen  split.  In 
the  event  data_is  missing  for  a  split  in 
the  tree,  CART^  would  then  split  on  the 
variable  with  the  greatest  association. 
This  associated  split  is  called  a 
surrogate  split. 

3)  Linear  combination  splits  are 
allowed.  If  the  structure  for  a  given 
problem  depended  on  a  combination  of 
variables,  univariate  splits  would  prove 
unsatif actory .  As  mentioned  earlier, 
CART*’  allows  for  linear  combination 
splits. 

4)  Variable  importance;  CART 
provides  as  part  of  its  output  a  ranking 
of  the  variables.  These  may  prove 
useful  in  identifying  variables  with  the 
most  predictive  power. 

SUMMARY  AND  CONCLUSION 

The  reader  has  been  introduced  into 
the  classification  problem  and  the  use 
of  binary  tree  cla^^aif  iers  . 

Specifically,  CART*  has  proven  to  be  a 
procedure  rich  in  mathematical  theory, 
as  well  as,  a  powerful  software  package 
that  performs  classification  and 
regression . 

Ti^g  many  nonstandard  features  that 
CART*  supports  makes  it  appealing.  In 
addition  to  being  a  nonparametric 
approach,  it  also  provides  an 
interesting  alternative  to  more 
classic^^  statistical  methods. 

CART*  's  decision  rules  are  easy  to 
use,  understand  and  interpret.  It 
provides  interesting  analysis  of 
problems  from  various  disciplines 
including,  the  social  sciences, 
medicine,  physical  science, 
surveillance,  etc.. 
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Abstract 


Generating  Code  for  Partial  Derivatives: 
Some  Principles  and  Applications  to  Statistics 


John  W.  Sawyer,  Jr. 
Texas  Tech 


The  author  re-examines  his  previous 
results  on  generating  code  for  first 
partial  derivatives  of  a  function  using 
the  natural  action  of  a  compiler:  Source 
code  for  the  function  alone  yields 
object  code  for  its  first  partials,  and 
this  object  code  executes  in  a  time  at 
most  proportional  to  the  execution  time 
for  the  original  function.  (The 
proportionality  constant  does  not  depend 
on  the  function  or  the  number  of  its 
arguments.)  Implications  of  these 
results  for  the  generation  of  code  for 
higher  order  partials  will  be  discussed, 
as  will  applications  to  some  statistical 
methodologies . 

1 .  Introduction 

In  Sawyer  (1984)  the  author  presented 
a  strategy  for  computation  of  first 
partial  derivatives  of  a  function.  This 
strategy  can  be  integrated  smoothly  into 
the  natural  action  of  compiler.  The 
development  of  this  strategy  was 
motivated  by  a  desire  to  streamline  the 
manner  in  which  a  function  had  to  be 

for  Grizzle,  Starmer,  and  Koch 
(1969)  analysis  of  categorical  data, 
though  the  applications  of  the  strategy 
are  much  wider .  A  brief  discussion  of 
applications  of  strategies  for 
efficient,  transparent  computation  of 
first  and  higher  order  partials  will  be 
found  at  the  end  of  this  paper,  though 
statisticians  should  require  little 
convincing  of  the  value  of  a  convenient 
way  to  get  efficiently  computed 
derivatives . 

Subsequent  discussions  with 
colleagues  have  led  the  author  to 
conclude  that  there  is  some  confusion 
about  what  is  different  in  the  Sawyer 
(1984)  paper  from  other  attac)ts  on 
automatic  differentiation  such  as  the 
monograph  by  Rail  (1981).  The  answer  to 
this  is,  that,  to  the  best  of  the 
author's  knowledge,  it  had  not  been 
pointed  out  before  that  (i)  without 
changing  its  scanner  or  parser,  a 
compiler  which  is  capable  of  producing 
object  code  to  evaluate  a 
user-programmed  function  can  be  modified 
to  produce  object  code  for  first 
partials  (no  symbol  mannipulation  of 
source  code  is  involved)  (ii)  this 
modification  will  produce  object  code 
which  computes  all  partials  in  a  time 
proportional  to  the  time  which  it  takes 
simply  to  evaluate  the  user  function. 

It  IS  not  the  intent  of  this  paper  to 
redevelop  these  ideas  in  detail,  as  they 
are  discussed  at  length  in  Sawyer 


(1984).  A  brief  review  of  the  basis  for 
(i)  and  (ii)  above  is  useful,  however, 
in  that  it  suggests  how  relatively 
efficient  automatic  generation  of  higher 
order  partials  might  proceed.  The  next 
section  provides  such  a  review,  while 
Section  3  discusses  second  partial 
generation. 

2.  First  Partial  Generation:  A  Review 

Consider  the  arbitrary  function 
programmed  in  a  high  level  language  such 
as  FORTRAN: 

F=( (X2**2)*C0S(X1) )/  (X1+2.*EXP(X1/X2) ) 

(2.1) 

A  compiler  will  first  scan  this  code  to 
identify  variables,  constants, 
operators,  and  relations,  translating 
each  into  appropriate  numeric  codes. 

This  numeric  translation  will  then  be 
parsed,  resulting  typically  in  a  object 
code  which  is  represented  here  in 
high-level  form  for  reader  convenience: 

A1=X2;  B1=A1**2;  A2=X1;  B2=COS(A2); 
C1=B1*B2;  A3=X1;  A4=2.;  A5=X1;  A6=X2; 

B3=A5/A6;  C2=EXP(B3);  D1=A4*C2; 

E1=A3+D1;  F=C1/E1;  (2.2) 

The  trivial  statements  "A1=X2",  "A2=X1", 
etc.,  represent  the  recognition  by  the 
parser  of  constant  or  variable.  Some  of 
the  statements  in  (2.2),  such  as 
"E1=A3+D1",  can  be  carried  out  readily 
in  an  arithmetic  register,  while  a 
statement  such  as  C2=EXP(B3)  will 
require  a  macro  of  some  sort.  The 
important  point,  is,  however,  that  as 
long  as  one  is  working  with  floating 
point  numbers  with  mantissa  and  exponent 
of  a  fixed  maximum  number  of  significant 
digits,  there  will  be  an  upper  and  lower 
bound  on  the  time  needed  to  execute  each 
of  the  primitive  steps  in  (2.2). 

Let  us  associate  algebraic  variables 
X  with  XI,  a  with  Al,  etc. 

1  1 

Then  Figure  1  gives  a  parse  tree  for  the 
function  (2.1)  in  terms  of  these 
algebraic  variables.  The  purpose  of 
labeling  the  edges  of  the  tree  with 
partials  as  shown  becomes  clear  when  we 
note  that 

,\l  ,<h,  ^  _  oh.  iolj  01  (2.3) 

,t\.  Otij  oh.  it.  I  H(, (ft;  (f(/|  Of. 
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FIGURE  1 

As  is  discussed  in  Sawyer  (1984),  this 
derivative  can  be  computed  by  simply 
multiplying  the  partials  found  along  the 
edges  from  the  root  to  the  leaves  of  the 
tree  corresponding  to  x  and  summing  up 
the  products.  2 

Now  what  is  the  payoff  in  the 
"multiplying-down-the  tree"  trick  from 
the  point  of  view  of  object  code?  Why 
do  we  not  simply,  in  effect,  choose  each 
leaf  in  left  to  right  sequence,  multiply 
up  the  tree,  and  add  to  accumulators  for 
the  partials?  The  answer  is  that  one 
must  be  careful  not  to  do  redundant 
multiplications.  The  strategy  of  doing 
multiplications  from  the  top  down  leads 
naturally  to  a  code  generation  scheme 
which  does  avoid  redundant 
multiplications . 

Code  to  compute  first  partials  of  a 
function  being  parsed  such  as  (2.1)  may 
be  generated  as  follows.  When  the 
parser  recognizes  a  node,  code  to 
evaluate  the  partials  of  that  node 
respect  to  its  arguments  is  generated  at 
the  same  time  as  code  to  evaluate  the 
node.  For  instance,  after  the  node  C2 
is  recognized  in  the  process  of  parsing 
(2.1),  the  code  "DC2DB3=C2"  will  be 
generated  as  well  as  the  statement 
"C2=EXP(B3)"  already  seen  in  (2.2) 
above.  ("DC2DB3''  is  simply  a  readei 
convenience  representing  a  storage 
location  in  which  the  partial  of  C2  with 
respect  to  B3  is  to  be  kept.)  A 
statement  of  the  form  "DFDB3  =  DFDC2  * 
DC2DB2"  will  also  be  generated  and 
pushed  onto  a  special  stack  for  such 
statements.  After  an  entire  function 
such  as  (2.1)  is  parsed  and  other  object 
code  generated,  the  elements  of  this 
stack  will  be  popped,  causing  them  to 
enter  the  stream  of  object  code  in 
reverse  order.  A  portion  of  this  code 
popped  from  the  stack  for  (2.1)  would 
look  like  this: 


DFDDl  =  DFDE1*DE1DD1;  DFDA3  =  DFDEl  * 
DE1DA3;  DFDX1=  DFDX1+DFDA3 ;  DFDC2  = 

DFDDl *DD1DC2;  DFDB3  =  DFDC2  *  DC2DB3; 
DFDA6  =  DFDB3  *  DB3DA6;  DFDX2  = 
DFDX2+DFDA6 ;  DFDA5  =  DFDB3  *  DB3DA5 ; 
DFDXl  =  DFDXl  +  DFDA5 ;  ...  (2.4) 

Note  that  when  (2.4)  is  actually 
executed  that,  according  to  the  the  way 
the  object  code  for  (2.1)  has  been 
generated,  every  variable  on  the  right 
hand  side  of  any  statement  in  (2.4)  is 
already  well  defined.  (Assume 
accumulators  have  been  zeroed. )  Note 
also  that  DFDB3  is  computed  only  once, 
and  suffices  to  compute  both  DFDA6  and 
DFDA5  in  two  more  steps.  Thus,  while  we 
have,  in  effect,  "multiplied  down  the 
tree"  to  both  the  leaves  A5  and  A6,  we 
have  not  performed  any  redundant 
multiplications . 

A  little  thought  shows  chat  the  above 
approach  yields  object  code  which 
performs  a  number  of  multiplications 
"down  the  tree"  which  is  less  than  the 
number  of  edges  in  the  tree.  (The  number 
of  such  multiplications  is  in  fact 
exactly  equal  to  the  number  of  edges  in 
the  tree  less  the  number  of  arguments  of 
the  root  node.)  Further,  in  the  case  of 
nodes  which  represent  binary  or  unary 
operations,  as  is  the  case  for  (2.1), 
the  time  taken  to  evaluate  all  the 
partials  along  the  edges  of  Figure  1 
must  be  bounded  by  a  time  proportional 
to  the  number  of  edges  in  the  tree. 
(There  must  be  some  maximum  time  which 
it  takes  to  compute  a  partial  of  any  of 
a  finite  set  of  built-in  unary  or  binary 
operators).  Finally,  the  time  it  takes 
to  add  the  partials  of  the  function  with 
respect  to  its  leaves  to  the  appropriate 
accumulator  is  certainly  bounded  by  a 
time  proportional  to  the  number  of 
leaves  in  the  tree.  Since,  as  we  have 
already  noted  above,  there  is  a  maximum 
time  which  it  takes  to  evaluate  any  of  a 
finite  set  of  unary  or  binary  built-in 
operators,  it  follows  that  the  time  it 
takes  simply  to  evaluate  a  function  such 
as  (2.1)  is  also  proportional  to  the 
number  of  edges  in  the  tree.  Hence, 
since  both  the  times  needed  to  evaluate 
the  function  and  the  to  evaluate  all  its 
partials  are  proportional  to  the  number 
of  edges  in  the  parse  tree,  it  follows 
that  the  partials  of  such  functions  can 
be  evaluated  in  a  time  at  most  kt,  where 
t  is  the  time  in  which  the  function  is 
evaluated  and  k  depends  only  on  the  high 
level  language  and  machine  used. 

The  reasoning  above  can  also  be  used 
to  apply  the  kt  result  to  functions 
involving  sum  and  product  operators, 
though,  as  Sawyer(1984)  points  out,  care 
must  be  taken  so  that  computation  time 
for  product  operators  remains 
proportional  to  the  number  of  edges  in 
the  function.  The  reader  is  referred  to 
that  paper  for  details. 
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It  should  be  noted  that  the  kt  bound 
is  attained  for  one  relatively  simple 
function:  the  product  of  n  distinct 
variables.  The  efficient  way  to 
calculate  its  partial  with  respect  to  a 
given  variable  is  to  divide  the  product 
by  that  variable  (assuming  none  of  the  n 
variables  are  0  when  the  product  is 
evaluated) .  Both  the  time  needed  to 
calculate  the  product  and  the  time 
needed  to  calculate  the  partials  by 
division  will  be  proportional  to  n  for 
large  n.  Further,  k  should  roughly  be 
the  ratio  of  the  time  it  takes  to  do  a 
floating  point  division  to  that  which  it 
takes  to  do  a  floating  point 
multiplication . 

3 .  Toward  Automatic  Generation  of 
Efficient  Code  for  Second  Partials 

What  follows  is  a  brief  verbal 
description  of  how  automatic  generation 
of  efficient  code  for  second  partials  of 
a  user-programmed  function  might  be 
carried  out.  The  attack  is  a  extension 
of  the  paradigm  above  and  of  Sawyer 
(1984).  (An  exhaustive  treatment  of 
this  topic  a  bit  involved  to  be 


since  each  of  these  partials  must 
involve  at  least  one  arithmetic 
operation  apiece  to  produce  it,  the 
proportional- to- time- squared  bound 
cannot  be  beaten  for  products.  (We  can, 
of  course,  do  much  worse.  Each  second 
partial  of  the  product  with  respect  to 
two  distinct  variables  is  defined  as  the 
product  of  the  remaining  n-2  variables. 
If  we  compute  all  non-zero  partials  in 
this  manner,  we  perform  (n-3 ) n(n-l ) /2 
multiplications,  so  that  our  bound  on 
second  partial  computation  time  becomes 
proportional  to  product  computation  time 
cubed. ) 

In  principle,  it  should  be  possible 
to  compute  the  second  partials  of  a 
function  in  a  time  proportional  to  the 
square  of  the  evaluation  time  for  the 
function.  Consider  a  function  h  which 
is  a  can  be  programmed  as  a  composite  of 
built-in  and  user  supplied  functions 
(for  which  the  user  also  supplies  first 
and  second  partials).  Suppose  we  write 


h=h[g  (f  , . . .f  ) , . . .g  (f  , . .  .f  )  ]  (3.1) 
11  n  ml  n 


presented  in  a  few  Proceedings  pages.) 

We  first  note  that  there  is  a  lower 
limit  on  how  efficient  computation  of 
second  partials  of  a  function  can  be, 
compared  to  the  time  it  takes  simply  to 
evaluate  the  function.  Return  to  the 
example  of  the  product  of  n  distinct 
variables  discussed  above.  The  second 
partial  of  the  product  with  respect  to 
two  of  these  n  variables  may  be  computed 
by  dividing  the  product  of  all  the 
variables  by  the  product  of  the  two 
variables  in  question.  This  means  that, 
for  a  specified  set  of  values  of  the 
variables,  n(n-3)  operations  are 
required  to  obtain  the  partials.  (This 
figure  arises  from  the  fact  that  two 
operations  are  required  to  compute  a 
partial  with  respect  to  each  combination 
of  variables,  and  that  the  second 
partial  of  a  product  with  respect  to  the 
same  variable  taken  twice  is  0).  since 
the  time  needed  to  compute  the  product 
is  proportional  to  n,  it  follows  that 
the  time  needed  to  compute  the  second 
partials  as  above  should  be  proportional 
to  the  square  of  the  function  evaluation 
time. 

The  reader  should  satisfy  him-  or 
herself  that  a  bound  proportional  to  the 
square  of  product  evaluation  time  cannot 
be  improved.  Remember,  each  distinct 
variable  is  free  in  general  to  take  on 
any  value.  In  particular,  the 
respective  values  can  be  the  first  n 
primes.  The  (n(n-l)/2)-n  non-zero 
second  partials  evaluated  for  these 
values  of  the  variables  will  all  be 


Now  let  us  specify  that  for  a  user 
program  which  evaluates  h  that  a 
compiler  will  recognize  the  g's  as 
either  built-in  or  user  specified 
functions  with  f's  as  arguments,  that 
is,  on  recognizing  the  g’s  it  will 
immediately  be  able  to  produce  code  for 
the  first  and  second  partials  of  the  g's 
with  respect  to  the  f's.  Formally,  the 
f's  must  be  daughters  of  the  g's  in  the 
parse  tree  with  root  h.  On  the  other 
hand,  we  make  no  restriction  on  how  deep 
the  g’s  may  be  in  the  parse  tree.  We 
ultimately  want  second  partials  of  h 
with  respect  to  a  certain  set  of 
variables;  each  of  the  f's  will  be  roots 
of  parse  trees  which  have  these 
variables  and  constants  for  leaves. 

Now  by  a  double  application  of  the 
chain  rule  we  get 
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distinct  numbers.  Since  the  number  of  (3.2) 

distinct  partials  to  be  evaluated  is 

proportional  to  the  square  of  the  There  is  actually  cuite  a  bit  in  the 

product  evaluation  time  for  large  n,  and  structure  of  (3.2)  that  we  can  take 
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advantage  of  for  efficienct  code 
generation,  with  a  bit  of  massaging. 

Note  that  there  is  a  recursive  structure 
to  (3.2):  given  that  we  have  the  first 
and  second  partials  of  h  with  respect  to 
the  g's,  and  that  we  can  generate  code 
to  compute  the  first  and  second  partials 
of  the  g's  with  respect  to  the  f's,  we 
do  indeed  get  the  partials  of  h  with 
respect  to  the  f’s.  Repeated 
application  of  (3.2)  will,  not 
surprisingly,  talte  us  from  the  root  of 
the  parse  tree  down  to  a  point  at  which 
the  partials  of  h  with  respect  to  the 
variables  of  interest  will  be  obtained. 
Yet  how  we  apply  (3.2)  in  such  a  way  as 
to  insure  that  the  partials  will  be 
obtained  in  a  time  proportional  to 
function  execution  time  squared  is  not 
immediately  apparent. 

The  tric)i  for  obtaining  the  )4t  bound 
on  first  partial  computation  is  to  )teep 
the  computation  time  for  those  partials 
proportional  to  the  number  of  edges  in 
the  parse  tree.  Similarly,  what  we  want 
to  do  for  second  partials  is  to  )ceep  the 
execution  time  for  second  partial 
computation  proportional  to  the  number 
of  edges  in  the  tree  squared.  One  way 
to  do  this  is  to  traverse  the  tree  in  a 
outer  loop,  from  right  to  left,  stopping 
a.,  each  leaf  to  traverse  the  tree  from 
that  leaf  to  the  right.  If  this  this 
procedure  computes  the  second  partials 
in  such  a  way  that  there  is  some  fixed, 
maximum  number  of  operations  of  maximum 
time  duration  for  each  edge  traversed, 
then  the  second  partials  will  be 
computed  in  a  time  proportional  to 
function  execution  time  squared. 

In  fact,  such  a  nesting  of  tree 
traversals  can  be  accomplished.  In 
practice  most  of  the  terms  in  (3.2)  drop 
out  as  one  moves  from  the  root  to  the 
leaves  of  the  parse  tree.  For  instance, 
consider  the  contribution  of  the 
variables  A1  and  A6  seen  in  (2.2)  and 
the  parse  tree  Figure  1  to  an 
accumulator  for  the  second  partial  of  F 
with  respect  to  XI  and  X2.  In  this 
instance  (3.2)  will  reduce  to  twice  the 
second  partial  of  F  with  respect  to  Cl 
and  El  times  the  product  of  first 
partials  down  the  edges  from  Cl  to  A1 
times  the  product  of  the  first  partials 
down  the  edges  from  El  to  A6.  As 
another  example,  to  get  the  contribution 
of  the  nodes  A5  and  A6  to  the 
appropriate  accumulator,  one  applies 
(3.2)  iteratively  to  F[E1(D1)], 
F[D1(C2)],  and  F[C2(B3)],  and  then 
multiplies  the  second  partial  of  F  with 
respect  to  B3  by  the  first  partials 
associated  with  the  arguments  of  B3. 
Finally,  the  second  partial  of  B3  with 
respect  to  A5  and  A6  is  multiplied  by 
the  product  of  first  partials  down  the 
edges  from  the  root  to  B3,  and  the 
result  added  to  the  foregoing  product. 
For  each  iteration  (3.2)  in  this 
instance,  each  of  the  two  sums  on  the 


right  hand  side  of  the  equation  will 
involve  only  one  term. 

In  general,  the  contribution  of  any 
two  leaves  in  a  parse  tree  to  an 
accumulator  for  second  partials  will  be 
computed  as  follows:  through  iterative 
application  of  (3.2),  compute  the  second 
partial  of  the  root  with  respect  to  that 
node  at  which  the  paths  up  the  tree  from 
the  leaves  in  question  join.  Each 
iteration  will  involve  only  single  terms 
for  each  of  the  two  sums  on  the  right 
hand  side  of  (3.2).  One  more  iteration 
is  done  to  obtain  the  second  partial  of 
this  node  with  respect  to  the  arguments 
on  the  paths  from  this  node  to  the 
leaves  in  question.  The  rest  of  the 
computation  is  then  simply  a  matter  of 
multiplying  this  second  partial  by  the 
products  of  the  first  partials 
associated  with  the  edges  connecting 
these  arguments  with  their  respective 
leaves . 

Now  it  should  be  evident  that  as  the 
contributions  of  pairs  of  leaves  to 
accumultors  for  second  partials  are 
collectively  computed  that  many 
computations  do  not  have  to  repeated  for 
every  pair.  Once  we  have  computed  (from 
the  top  down)  the  first  and  second 
partials  of  the  root  with  respect  to  a 
node  in  the  tree,  or  the  second  partial 
of  the  root  with  respect  to  several 
arguments  of  a  node  within  the  tree,  we 
do  not  have  to  recompute  these  partials 
every  time  we  want  the  contribution  of 
pairs  of  leaves  which  have  paths  up  to 
these  arguments.  All  we  need  once  these 
things  are  computed  are  the  products  of 
the  first  partials  down  the  edges  from 
these  arguments  to  the  leaves.  But  the 
first  partial  computation  process  of 
Section  2  above  provides  the  product  of 
first  partials  down  the  edges  from  the 
root  to  the  these  arguments  and  from  the 
roots  to  the  leaves .  Thus  the  product 
from  the  appropriate  argument  down  to  a 
leaf  can  be  obtained  by  a  single 
division.  If  we  now  require  that  the 
second  partials  of  the  operation 
represented  by  any  node  with  respect  to 
its  arguments  be  computable  in  time 
proportional  to  the  squared  of  the  time 
it  ta]ces  to  evaluate  the  node,  then  all 
the  forgoing  will  fit  together  to 
produce  a  strategy  for  evaluation  of 
second  partials  for  which  the 
computation  time  is  indeed  proportional 
to  the  number  of  edges  in  the  parse  tree 
squared.  Binary  and  unary  operators 
will  meet  the  requirement,  as  will  sum 
and  product  operators,  if  the  latter  is 
properly  handled. 

Object  code  for  second  partials  is 
generated  along  the  lines  that  code  for 
first  partials  is  produced,  including 
the  use  of  a  stac)c  of  code  to  be  output 
in  reverse  order  after  the  root  is 
recognized.  Each  time  a  node  in  the 
parse  tree  is  recognized,  code  for 
second  partials  with  respect  to  its 
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arguments  must  be  generated.  Code  from 
the  stack  relating  to  second  partials 
mirrors  the  behavior  of  code  in  that 
stack  for  computation  of  first  partials, 
although  (3.2)  must  now  be  appropriately 
incorporated  (Again,  tne  exact 
particulars  are  beyond  the  scope  of  this 
paper).  As  should  be  evident  from  the 
foregoing  discussion,  considerable 
backtracking  in  the  actual  execution  of 
the  object  code  popped  from  the  stack 
will  be  involved,  as  the  contribution  of 
the  combination  of  each  leaf  and  every 
to  its  right  to  some  accumulator  for 
second  partials  must  be  computed.  This 
would  appear  to  necessitate  a  system  of 
pointers  not  necessary  for  first  partial 
computation  alone. 

Problems  still  remain  with  the  second 
partial  computation  scheme  sketched 
above  for  which  space  does  lot  allow  for 
ample  discussion.  One  problem  is 
dangling  nodes  in  the  tree  which  are  not 
really  proper  leaves.  These  correspond 
to  variables  in  source  code  which  appear 
on  the  left  hand  side  of  an  assignment 
statement  once  and  on  the  right  hand 
side  of  a  statement  more  than  once. 

This  problem  is  solved  in  Sawyer (1984) 
for  the  first  partial  case,  and  the 
second  partial  case  is  a  generalization 
of  that  solution.  Space  complexity  is 
an  issue  even  with  first  partial  code 
generation,  and  will  be  even  more  so 
with  second  partials.  A  certain  amount 
of  recomputation  of  partials,  rather 
than  storing  them  indefinitely,  may  be 
necessary  for  some  functions  as  a  proper 
tradeoff  between  space  and  time  costs. 


As  stated  above,  the  authors  work  on 
first  partials  was  motivated  by  a  desire 
to  streamline  weighted  least  squares 
analysis  of  categorical  data.  Some 
discussion  of  applications  of  the 


strategy  of  Section  2  for  first  partial 
code  generation  is  given  in  Sawyer 
(1984),  and  to  some  extent  this  carries 
over  to  second  parti<_l  generation  as 
well. 

To  briefly  carry  the  applications 
idea  further,  consider  the  case  in  which 
a  likelihood  is  to  be  maximized  over  a 
large  number  of  (say  nuisance) 
variables.  Even  if  the  second  partials 
matrix  is  not  used  in  the  search  routine 
itself,  that  matrix  may  be  preferable  as 
a  source  of  asymptotic  variances  of 
estimated  parameters,  or  even  as  a 
criterion  which  can  be  checked  for 
negative  definiteness  to  verify  that  one 
is  indeed  at  a  maximum.  Another 
application  may  be  to  biased  estimates 
of  a  function  of  a  large  number  of 
nuisance  parameters,  if  such  estimates 
are  constructed  from  a  set  of  reasonably 
consistent,  unbiased  estimates  of  these 
nuisance  parameters.  The  first  term  in 
a  Taylor  series  expansion  of  the  bias  in 
the  estimator  of  the  function  will 
involve  second  partials  of  the  function. 
Ability  to  compute  these  partials 
readily  may  allow  the  construction  of  a 
beneficial  bias  correction  term. 
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NOISE  APPRECIATION:  ANALYZING  RESIDUALS  USING  RS/EXPLORE 
David  A.  Burn  and  Fanny  L.  O’Brien,  BBN  Software  Products  Corporation 


The  RS/Explore  software  is  a  statisticcd  advisory  environ¬ 
ment  for  performing  analysis  of  general  linear  modeb.  One 
g.'al  of  data  analysis  is  to  find  a  “model”  that  adequately 
describes  the  variation  in  the  data.  Residual  analysis  is  an 
invaluable  tool  in  selecting  and  validating  a  model.  We  will 
examine  how  RS/Explore  provides  convenient  access  to  tra¬ 
ditional  and  innovative  graphical  displays  useful  in  residual 
analysis. 

KEY  WORDS:  Residual  Analysis,  Studentized  Residual,  In¬ 
fluence,  Leverage,  Cook’s  Disttince 

1.  INTRODUCTION 

A  primary  objective  of  data  analysis  is  to  find  a  model  that 
adequately  describes  the  variation  in  the  data.  The  process 
of  building  models  consists  of  the  following  steps: 

1.  Model  Selection 

(a)  Determine  general  class  of  models. 

(b)  Identify  parsimonious  subclass  of  models. 

(c)  Apply  transformations  to  data. 

2.  Model  Examination 

(a)  Fit  model  to  data. 

(b)  Compute  estimates  of  parameters. 

3.  Model  Validation 

(a)  Check  model  assumptions. 

(b)  Check  model  fit. 

4.  Model  Implementation 

(a)  Predict  future  values  of  the  process. 

(b)  Control  future  values  of  the  process. 


I — H  Selection 


Examination 


H -  Validation 


_ ± _ 

Implementation 


Approach  to  Model  Building 


2.  RS/Explore  Software 

2.1  Objectives  of  the  Software 

The  RS/Explore  software  provide? 

•  an  interactive  computing  environment  for  data  analysis, 
regression  modelling,  and  interpretation  of  results 


•  a  menu  system  which  allows  the  progreim  to  be  used 
effectively  by  nonstatisticians,  especially  industrial  sci¬ 
entists  and  engineers 

•  statistical  tools  which  help  the  data  analyst  avoid  the 
most  common  pitfalls  and  inappropriate  analyses  in  re¬ 
gression  modelling 

2.2  Menu  System  for  Building  Models 

The  menu  system  in  RS/Explore  encompasses  the  iterative 
approach  to  building  models.  In  particular,  the  activity  of 
model  validation  is  simplified  by  the  ability  to  explore  resid¬ 
uals  through  a  variety  of  traditional  ^md  innovative  graphical 
displays. 

Menu  System  for  Building  Modeb 


2.3  Screen  Display  for  RS/Explore 

The  terminal  screen  in  RS/Explore  is  partitioned  into  three 
regions:  graphics,  menu,  and  dialogue.  In  the  graphics  re¬ 
gion,  RS/Explore  displays  graphical  objects  such  as  boxplots 
and  scatterplots,  and  nongraphical  objects  such  as  AOV  ta¬ 
bles  and  coefficients  tables.  In  the  menu  region,  RS/Explore 
displays  the  list  of  currently  available  options,  and  highlights 
one  or  more  of  these  options  as  appropriate  next  steps  in  the 
data  analysis.  In  the  dialogue  region,  RS/Explore  displays 
information  regarding  interpretation  of  statistical  procedures 
and  echos  keyboard  input. 


Screen  Display  for  Residual  Analysis 


IMtnt  rrMOMDIZCD 


1  S«lKt  RESPONSE 
3  HISTOGRAM 

3  CASE  0»d«r 

4  FITTED  V«Ja« 

5  Asy  variable 

6  PROBAfimn*  pi«t 

7  LAC  Send  CrB|»k 
I  PARTIAL  RcfTw 
9  INFLUENCE  Ptot 

10  DISPLAY  Dau 

11  Add  LOV^TSS 

13  SCALE  tUnduli 

13  NEXT 

14  MAIN 


MULREC  FTT  REFINE  RESIDS  > 


307 


2.4  Formulas  for  Regression  Diagnostics 

The  RS/Erplore  software  defines  formulas  for  regression  di¬ 
agnostics  as  follows: 

Residual 

ei  =  Yi-  y 

Studeniizcu  Residual 


'(i)  ^ 


1/2 


Mean  Squared  Error 


•  n  -  p 

Studentized  Mean  Squared  Error 

- 


’(0  - 


n  -  p  -  1 


Cook’s  Distance 

Leverage  Point  Rule 


Any  observation  such  that  hi  >  2p/n 
Influence  Point  Rule 


Any  observation  such  that  Ci  >  C.ss 


3.  Data  Analysis  Example 

3.1  Description  of  Dataset 

The  Car  Dataset  consists  of  n  =  385  observations  on  8  char¬ 
acteristics  of  automobiles. 


3.3  Least  Squares  Fit 

The  fitted  model  and  the  analysis  of  variance  table  are  as 
follows: 

Fitted  Model 

Y  =  1.0175  4-  0.1304Xi  -  0.0403X2  -t-  O.521OX3 


Analysis  of  Variance 


Sum  of  Naan 


Source 

df 

Squares 

Square  F 

-Ratio 

p-value 

Regression 

3 

870.42 

290 . 14 

536.40 

0.0000 

Residual 

381 

206.09 

0.54 

Lack  of  fit 

339 

191.79 

0.67 

1.66 

0.0232 

Pure  Error 

42 

14.29 

0.34 

Total 

384 

1076.51 

R-squar«d  =  0.8086 
Adjusted  R-squared  =  0.8070 
Standeerd  Error  =  0.7355 


3.4  Analysis  of  Residuals 


The  residuals  from  the  fitted  model  are  examined  using  a 
variety  of  graphical  displays.  The  scale  of  the  residuals  in 
all  displays  may  be  specified  as  raw,  studentized  (default), 
absolute  raw,  or  absolute  studentized.  A  lowess  curve  may 
be  added  to  all  residual  displays  to  identify  trend. 


Name 

Units 

Scale 

MPG 

mi/gal 

Measurement 

CYLINDERS 

4,  6,  8  cyl 

Rank 

DISPLACEMENT 

cu  in 

Measurement 

HORSEPOWER 

hp 

Measurement 

WEIGHT 

lbs 

Measurement 

SLUGGISHNESS 

sec/(0.25mi) 

Measurement 

YEAR 

1969-1981 

Rank 

OI.IGIN 

continent 

Category 

6PM100  voraua  MTIOO 


3.2  Identification  of  Model 

Our  objective  is  to  determine  the  relationship  between  gaso¬ 
line  economy  (response)  and  weight  (predictor)  and  number 
of  cylinders  (predictor). 

Model 


GPMIOO 


Y  —  00  +  PiXi  +  02^2  +  03X3  -)-  c 


Symbol 

Variable 

Y 

GPMIOO  =  100/MPG 

Xi 

WTIOO  =  WEIGHT/100 

X, 

CYLINDERS,  (C=6)-(C=4) 

X3 

CYLINDERS,  (C=8)-(C=4) 

HTIOO 

□  CYLINDERS-4 
O  CYLINDERS-e 
A  CYLINDERS-0 

-  CYLINOERS-4 

-  CYLlNDERS-6 

CYLINDERS-8 
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Hlatograin  of  Raalduala 
Ualng  RAH  Raalduala 


Raaldual 


Raalduala  va  Fitted  Valuea  of  GPMIOO 
Ualng  STANOAROIZED  Raalduala 


Fitted  Value 


Caaa  Order  Graph  of  Raalduala 
Ualng  STANOAROIZED  Raalduala 


Raalduala  va  Flttad  Valuaa  of  GPMIOO 
Ualng  ABSOLUTE  STUDENTIZED  Raalduala 


Caaa  Number 

- - (0.5)  Smoothed  Raalduala 


Raaldual 


Fitted  Value 
(0.5)  Smoothed  Reelduale 
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Residuals  vs  Predictor  Values 
■Jalrs  STANDAPPIZEO  Roslfi.,*!*. 


FWalduals  V8  YEAA  Vsluas 

Ualno  STAfOAnoiZtK  Rvaiousla 


Residual 


HTlOO 


□  CYLINOERS-4 
O  CYLINDERS-6 
A  CYLINDERS-B 


Itoaidual 


□ 

□  o 

□ 

a 

n  ri.  [i  q  T 

h  ri  V  q  d  c 

T 

lii 

□ 

T  T  Q 

i  1  ‘  dd 

°  s  1 

•  °  a 

□  □ 

□ 

89  70  71  73  73  74  79  7B  77  78  7«  BO  11 


YEAR 


Residuals  vs  Predictor  Values 
Using  STANOAROIZEO  Residuals 


Normal  Probability  Plot  of  Residuals 
Using  STANDARDIZED  Residuals 


Residual 


CYLINDERS  4 
#  PtS  199 

Mean  2e-16 

IQR  1.086974 


6  B 

03  103 

2e-16  48-16 

1.249066  1.567577 


Cum  Prob 


Residual 


Lao-1  Serial  Graph  of  Raaiduala 
Using  standardized  Residuals 


Influence  Plot  of  Residuals  of  GPMIOO 
Uslno  STUDENTIZED  Rsslduals 


ith  residual 


(1  -  1)  th  residual 
(0.5)  Smoothed  Residuals 


Element  of  Hst  Matrix  Diagonal 

95*  of  Cook’s  Distance  (C-O.lOBl) 
Leverage  Point  Cutoff  (2o/n-0 .0208) 


Partial  Rsgreaslon  Plot  of  GPMIOO 
for  WTIOO 


Influence  Plot  of  Rsslduals  of  GPMIOO 
Using  ABSOLUTE  STUDENTIZED  Rsslduals 


-  Coef  “  0 . 1304 


Element  of  Hot  Matrix  Diagonal 

95*  of  Cook 'a  Distance  (C-0.1061) 
Leverage  Point  Cutoff  (2p/n-0 .0200) 
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4.  SUMMARY 
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•  provide  lowess  smooth  on  all  scatterplots 

•  ^ve  easy  access  to  regression  diagnostics 
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ABSTRACT 

riiis  paper  ileserihes  ilie  application  of  expeii  system  leehnol 
ogy  to  ihe  development  of  a  software  tool  for  processing  and  analy¬ 
sis  of  lime  series  signals.  The  system  integrates  distinct  numerical 
and  symbolic  prv'»ccssing  cores  to  form  an  analysis  environment 
where  numeric  and  symbolic  processing  tasks  are  performed  as 
needed  during  the  analysis.  Sopltisticaied  off  ifie  shelf  numerical 
analysis  software  is  coupled  with  a  higli  end  expeii  system  develop 
ment  shell  to  form  the  integrated  system.  A  knowledge  base  of 
rules  for  performing  ARIMA  (AuioRegiessive  Integrated  Moving 
Average)  time  series  modeling  is  implemented  in  the  system  proto 
ty'pe.  The  user  Interface  is  presented  in  a  multi  window  environ 
ment  on  a  workstation  with  bit  mapped  grapliics. 

I.  INTRODCCTlUN 

This  paper  describes  a  prototype  expert  system  for  signal 
processing  and  data  analysis,  called  COSTAR  (Coordinated  Statis¬ 
tical  Analysis  and  Reasoning).  As  illustrated  itt  Fig.  I,  the  philoso 
phy  beh.ir.d  the  system  design  is  to  integrate  four  key  functional 
components  of  the  data  analysis  process;  L'sers.  Graphics.  Syir. 
bolic  Rules,  and  Numerics.  I  he  system  can  be  used  as  a  "black 
box",  but  it  allows  intervention  of  the  user  at  selected  points  hi  the 
•inulysis.  Its  objective  is  to  sene  as  an  example  of  how  tlicse  func 
lional  areas  can  be  iniegraierl  lor  complex  signal  analysis,  as  well 
as  to  serve  as  a  test  bed  for  studying  issues  of  strategy  development 
and  rule  refinement  General  references  for  current  work  in  tlie 
area  of  artificial  intelligence  and  expen  systems  in  statistics  are 
Gate  (19Sb)  and  Haux  (1986).  Discussions  of  other  specific  sys¬ 
tems  can  he  fouml  there  .uhI  in  Gale  and  Pregibon  ( 1984).  Milios 
(1984).  Neltier  (1986).  aiul  Nii  ami  Feigenbaum  (19'^8). 


2.  statistical  PROBLKM  ADDRLSSED 

One  of  the  ohieenves  ()f  this  work  is  to  .simly  am)  formali/e 
analysis  strategies  for  statistical  signal  processing  and  modclmg. 
I  ins  w.is  also  on.  iiioiivalioii  in  the  development  ot  pionecimg 
systems  such  as  DIM)!-  (Oldtord  atui  Peters.  1988)  and  DARi' 
(Dmioho.  1984)  In  the  tievelopmeni  of  COS  PAR.  mterences 
about  gener.fl  statistrcal  strategies  foi  signal  .analysis  and  niotleling 


arc  developetl  by  .studying  a  panicuiar  motlclmg  problem  with 
wirlespread  applications;  .ARIMA  motleling  for  a  class  of  nonsta 
nonary  univariate  lime  series.  These  lechnkjues  begin  (after  outlier 
ideiuificaiion.  editing,  detrending,  and  variance  stabilizing  irans 
formations)  with  linear  tlifferenclng  opertiiions  to  triUisform  a  non 
stationary  scries  to  a  stationary  one.  This  is  followed  by  an  analysis 
ol  the  autocorrelation  (ACF)  and  panial  autocorrelation  (PACK) 
functions  to  ulentify  patterns  that  are  char.ictcnstic  of  tliffereni 
model  structures  and  sirders  Model  parameters  are  then  fit  to  the 
senes,  using  an  iterative  maximum  likelihooil  estimation  scheme  in 
the  general  ARIMA  case.  Tins  is  u>ilowcci  i'y  residual  analysis  lo 
score  or  rank  the  goodness  of  the  lined  iviodel  for  comparison  witli 
other  camiidaie  models  I  or  lietailctl  descripnons  of  me  ARIMA 
modeling  techniques  applied  in  this  work,  see  Box  and  Jenkins 
(19?6)  or  Biockwell  ami  Davis  (198"')  Seciir'ii  6  of  this  paper 
presents  an  example  showing  several  of  these  modeling  stages. 

While  the  emphasis  lierc  is  on  ARIMA  mode)  funn^  a.s  the 
objective  of  the  signal  .in.ilysls.  such  models  aie  widely  used  for 
forecasting,  and  can  also  be  combineil  with  inteiAention  analysis  to 
detect  discrete  changes  in  parameters  of  a  model  caused  by  exoge¬ 
nous  events. 

ti-l0877 


ALLIANT 

SYMBOLICS 

•  IMSL 

•  FORTRAN 

^  TCP/IP  ^ 

•  KEE 

•  LISP 

Figure  -  ILuxlware  and  Software  Integration 


.c  systkm  architecture 

l-.fleciive  auiomaierl  signal  processing  aiul  data  analysis  re 
quires  ilie  integration  of  ssniholu  and  immcrfi.  processing  In  this 
system,  the  syinhoJic  processing  burden  is  earned  by  the  rule  based 
expen  system  and.  in  a  limited  and  structured  way.  by  the  user. 
Ihe  numeric  processing  is  handled  by  a  modern  "number  camch- 
iiig  system  A  schematic  id  the  system  .uchilecline  is  shown  m 
I-ig  --  A  mini  supercomputer,  the  .Alliam  l-X  8  parallel  processor 
running  under  Concemnx  ?>  U.  pertorms  the  numeric  processing 
using  a  collection  of  15  20  routines  fiom  the  I  onian  based  IMSl. 
0  2  libiary  (IM.SI..  1084)  I'lic  computaiional  iei|uiremcms  for 
univ.iriate  time  senes  nuHielmg  do  not  really  demand  a  supercom 
puter.  but  this  archiieciurc  will  make  it  easier  to  adiliess  miMc 
computationally  intensive  signal  analysis  within  this  system  frame 
work  111  the  (uture.  Symbvilic  processing  is  hamlled  bv  a  Svinbolics 
3640  running  umler  Genera  "  1  .iml  using  tite  expen  system  devel 
opment  shell  KF'F,  (Knowledge  ^•ngmeenng  Finiu'iimeni)  3  1  (In 
teilicoip.  lo.x?)  .hhI  Common  l  isp  Intonnanon  is  evchangcil 
between  nunienc  ar.d  symbolic  processing  eniiiics  over  a  network 
ivperaling  uiulci  the  pn'tocols  ol  iCFiF  (  li  .insmis.sKMi  Ci'iitrol 
Frotivcv'l 'Intel  face  FuUvKxd)  the  us<m  intcnf.ice  is  puuided 
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lhrouk;h  ihc  Symbolics  svoi ksiaiion.  which  displays  hii-mappcd 
graphics  in  a  nuilii  window  ciiMronmcni  L^scr  inpuis  arc  pro\idcd 
through  a  keyhoartl  oi  .i  3  hiidon  mouse. 

The  system  reflects  an  integration  of  distinct  hardware  and 
sottware  functions  I'he  functional  outputs  are  integrated,  hut  the 
numeric  anti  symbolic  elements  remain  distinct,  passing  informa 
lion  across  the  network  to  each  other.  Advantages  of  such  a  con 
figuration  are  that  each  type  of  finrjwaic  (iml  sofl^\(o\'  ifu  •  is  mosi 
(ippf  oni'uiit’  for  mi  <7s/>c*7  of  iht'  itisk  is  upplieil.  and  existing  nu 
merical  signal  and  data  analysis  software  can  be  used  directly.  .A 
t/tsac/tanfage  is  ihc  fJeetl  for  the  tieveitiper  to  work  in  and  integrate 
two  different  hardware  and  software  environments  -  this  has  not 
provetl  to  he  .i  major  problem,  however.  While  this  (Tig  -)  is  a 
■'liigli  end"  hardware  configut aiitm.  it  is  now  possible  to  buiUl  .1 
more  economical  sysi-*m  with  similar  functionaiity  using  Kl:l:  and 
IMSI.  on  a  S03kh  fV  ‘  stem  running  uiuier  a  version  of  I  NfX. 

4.  THE  KNOWLEDGE  BASE 

Tlie  knowledge  base  (Kh)  uses  structures  that  are  becoming 
common  in  soplusticated  expen  system  software,  due  to  then  gen 
erality  and  power  These  arc  frames  and  object  oriented  prograni 
ming  technii,|ues  Tlie  knowledge  base  is  composed  of  three  major 
.segments;  (/mo  pruJin  iiiin  ruUs.  and  gni/)/m<//  ohj<\is. 

The  KB  structure  is  defined  so  as  to  reflect  the  natural  structure  ot 
die  imiJeling  problem  addressed  by  tlie  analysis. 

The  basic  objects  in  the  system  arc  data  sets  existing  at  a  par¬ 
ticular  stage  of  the  autilysis  l-or  e.\ampic.  the  knowledge  base  may 
imiudly  cont.tin  a  single  "raw"  <iaia  set  object.  I'he  operation  of 
data  editing  will  create  an  edited  data  set.  the  operation  of  trans 
forming,  a  transformed  data  set.  If  afternali\e  iiansforii*.-*  .ire  en 
tenained.  for  example,  multiple  transformed  data  sets 
correspoiumtg  to  a  single  edited  dat.i  set  may  he  pan  of  ilic  know! 
edge  base  f  iguie  3  sliows  a  fragment  of  a  hierarchical  tree  show 
ing  some  of  the  types  of  objects  in  the  knowledge  base  (Tigs  3  k 
are  displayed  at  the  end  of  this  anicle) 

PiOo  (*hjci!s  are  rcpicsented  by  frames,  wfiich  luild  ih.;  o.Ji> 
the  time  senes  of  data,  hut  a  saricty  of  additional  information 
about  a  paiiicular  kind  of  tlaia  object  Tor  e.xample.  a  data  object 
in  the  residual  class  ha.s  a  frame  with  slots  that  )H>ld  inforniaiion 
about  the  summ.ny  statistics  of  tlte  residuals,  the  model  used  to 
dense  the  resulu.ils.  scores  for  goo<.lness  of  fit  tests,  etc.  Tiguie  4 
shows  a  ponion  ot  tlie  residual  frame  attributes  for  this  system's 
KB  Seven  cl.isses  ot  dal.i  objects  are  defined  in  tlie  current  svs 
ie»»).  Ikua  objects  are  defined  .is  members  of  pn^ioixpe  classes  .uul 
inherit  aili  ibutes  appropriate  to  that  panicular  class.  The  notion  of 
prototype  v-as  defineif  m  Aikins  (  1^.h3>  and  lias  also  been  applied 
in  the  RI-.X  system  for  regression  analysis  (Gale  l9Ht>af.  In  .iddi 
(ion  to  helping  form.i)i/e  and  streamline  tfie  structure  of  tlie  expen 
s>s  -.‘in  KB.  tlie  ufe.i  is  useful  111  strategy  lormali/ation.  sour  o  sli 
iw  rs  (hr  stonso^um  /,/  donk  ouoc  ohsiraafy  ohoio. 

•  (tie  t>pes  of  objects  th.it  form  tfic  .inalysis  task 

•  When  ihe>  .ue  needetl 

•  How  ificy  geiicr.ili/c  to  or  specialize  from  otfier  classes  of 
objects 

The  primary  repository  of  signal  analysis  expertise  is  con 
tamed  in  the  /ooJui  iiioi  l  uir  sssiroi  segmcnl  of  the  knowledge 
base  riiis  is  .1  set  ot  if  ihrn  rules  that  perform  the  symbolic  rea 
sonmg  in  the  system,  as  well  as  controlling  the  numeric  processing 
Af>prt^xim<ile)>  .'^h  rules  .11  e  iinpiemented  m  the  first  generation 
COS'TAR  system,  but  the  knowlcilgc  base  is  undergoing  revision 
and  tfie  number  <,>f  rules  in  ihc  nc.xt  generation  system  is  expected 
to  at  least  tiouble  1  he  rules  are  generally  mvoketl  in  a  forward 
ch.ufiing  process,  consistent  with  the  "tlata  driven"  nature  of  a 
complex  sign.il  .inalysis  The  rules  in  the  KB  arc  partitioned  into 
rule  cl.isses  that  corresjn'iul  appro.xnn.ileh  to  the  way  that  ifala 
objects  are  divided  into  classes,  f  or  example,  a  class  of  "residiial 
lules"  tliai  .ire  .ippiopi  i.ite  ont>  loi  ajiplivation  to  lesufual  d.it.i 
objects  IS  defined.  Rules  .uc  defined  in  a  hierarcfiy.  as  shown  in 
Tig  5  Rule  classes  <ire  divided  mtii  subclasses,  and  generic  sub 
classes  tontain  specific  instances  of  particular  rules  (c  g  .  the  Box 


f.jiing  Scoring  Rule  in  the  Residu.il.  Rules  Class  in  the 
-ARl.M.A  Mottclmg  Rules  Supci  Class  shown  m  I  ig  s| 

fills  hier.uchical.  taiegoncai  struciunngol  lulcs  in  ihe  knowl 
edge  base  serves  several  purposes.  It  allows  tiu  structurcvl  develop¬ 
ment  ot  tlie  KH  -  portions  ot  die  overall  analvsis  c.ui  be  developed 
independently  fi  .ilso  peimits  a  more  iMganized  delniggiiig  of  the 
knowledge  base  at  an  miermediaie  level  Instead  of  evaluating 
eacli  rule  iiuliv Klually.  or  evalu.uing  the  final  svsu-m  conclusions 
based  on  firing  all  the  rules,  reasoning  c.in  be  invoked  using  a  par 
(icular  rule  subclass  Tins  .illows  tor: 

•  .Analssis  of  the  iniegi  .iied  pet  to;  in. nice  ot  .1  subset  of 
rules 

•  Scaling  tli'iwn  the  complexity  ol  piobicm  tliagnosis  sigjnf 
cantly 

In  .idvliiion  to  the  /ng/cu/  hierarchy  reflected  in  f  ig  5.  the 
rules  in  the  KB  also  reflect  a  ro>u  1  pinal  hiv.itrchy  that  is  derivetl 
from  an  idea  ol  the  generic  signal  analysis  process.  The  rules  can 
be  labeled  m  one  of  three  ways: 

•  Strategic 

•  T.'.-iIc.il 

•  Mechanic.il 

These  labels  rcllect  the  ways  in  which  knowledge  seems  to  be  pro- 
vule<l  by  experts  during  the  "knowledge  engmeei mg”,  i.e..  exper 
tise  extraction,  phase  of  KB  development.  Tor  example,  a  ibufcgic 
rule  may  describe  .11  wlnn  phase  of  the  analysis  checks  tor  non 
staiionarity  are  .ippioprlaie.  .An  associated  lariirol  iiile  could  iden¬ 
tify  ways  to  test  for  staiionarity.  such  as  examination  of  patterns  in 
ACTs  .ind  P.ACT’s.  Mi\  luoitCiil  rules  would  have  heuristics  tor 
identifying  patterns  in  computed  correlation  hmetions.  as  well  as 
triggers  for  invoking  those  numerical  procedures  needed  to  com¬ 
pute  correl.uions  if  they  had  not  already  been  computed.  Such  a 
concepiua)  rule  hierarcliy  is  clearly  useful  lor  identifying  and  for 
mali/ing  st.iiistic.il  analysis  strategies 

The  knowledge  base  .ilso  contains  ^roplorul  olytui.s  and  rules 
lor  controlling  them  Rules  for  the  creation  and  display  of  different 
types  ot  plots,  talvular  summaries,  and  icons  or  pushbuttons,  are 
coniainctl  in  the  knowledge  base  as  well  The  object-oriented 
structure  of  i)ic  system  allows  procedures  for  generating  and  dis 
playing  gr.ij^Juc  objects  to  )'e  linked  diiecily  to  ihe  data  objects 
ilvemselves  as  frame  attrilmies.  This  helps  to 

•  Grgani/c  the  KB 

•  Reminil  the  knowledge  engineer  tliat  graphical  displays 
play  a  role  on  .1  par  with  numenc  data  in  signal  analysis 
prol'leins 

5.  S^STEM  CONTROL  STR.ATEGIES 

Many  ot  tlie  ihfticuli  pioblenis  m  developing  an  effective  soft 
ware  tool  for  compuiei  guuted  dat.i  .inalysis  seem  to  arise  in  the 
area  of  vwft'ni  Konool  This  is,  after  all.  the  area  in  which  a  data 
analyst  sliows  most  0!  his  re.isoning  expertise  wiiat  numerical  corn 
pulation  shouiil  l'*e  peiloinicit  at  what  siagc.  and  what  should  be 
done  next  as  a  result  ol  liie  numerical  computations  performed  so 
far  ’  Wlule  expertise  in  mierprenng  the  results  ot  .i  particular  nu 
meiical  piweduie  is  often  needed,  ihr  urrif  foi  rxpr'i  ionno/  0/ 
ihr  (nnali  atuilsMs  is  iltounumi  (.'ontrol  strategnes  m  this  software 
system  aie  currently  being  refined  Tins  section  ouifmes  the 
pI.iriKnl  contiv'l  strategies 

.•\s  memioncsl  in  the  introduction,  the  user  plays  a  role  in 
contiollmg  the  system  tlmnigii  .1  Imnied  numbei  of  interaction  op¬ 
portunities  presented  bv  the  system  Tlicsc  inputs  take  the  form  of 
cither  information  supplied  I'y  the  user  at  initial  prompts,  or  by 
mouse  indicated  confirm. nion  of  veto  of  options  selected  by  the 
svstem  at  various  stages  ot  the  analysis  Tor  example.  Tig  0  shows 
a  tSTR  (^iniGNk^con  m  ilie  CCKRTM  ACTIN'IIT  window 
with  foiii  ivpcs  of  \. 111. Mice  siabill/mg  tt aiistiu ,.. .  that  are  available 
m  (he  system  Tlie  transfeam  highlighted  in  bl.ick  indicates  the  sys 
tern's  ici-ommend.iuon  \o\  .1  u ansfoi iiiaiion.  b.ised  on  the  rules  m 


ihe  kB.  The  user  nuiy.  iK'ivsesci.  use  the  mouse  to  highiikju  an 
other  choice  of  iransforni  anil  override  the  system  recommenda 
lion.  The  system  control  structure  invokes  a  break  in  the  forward 
chainiuki  aiienda  at  times  that  are  appropriate  for  such  a  user  selec 
lion. 

rhe  system  must  liave  siraieipes  and  control  structures  foi 
[iener.uin^.  ittamiainijii:  iUiil  raiikini:  mulnple  alternative  models 
for  the  time  series  ilaia.  At  each  stage  in  the  analysis  (where  a 
stage  is  defined  as  the  generation  of  a  particular  instance  of  a  pro 
lotypical  object,  e  g  .  a  specific  set  of  edited  data),  one  or  more 
alternative  objects  (pans  of  a  candidate  tnoticling  hypothesis!  are 
generated.  I'hese  objects  arc  scored  according  to  a  *’ceriaiiuy‘'  in 
the  facts  or  rule  conclusions  ifiai  leil  to  their  creation,  f-'or  e.\.am 
pie.  an  edited  data  set  created  by  deleting  a  point  iltat  has  magni¬ 
tude  huger  titan  fiv«.  times  the  "trei i|uanile  iiUtge  of  the  raw 
sample  may  have  a  veo  high  ccnaittiy  score.  An  ediietl  data  set 
created  by  deleting  that  point,  plus  another  point  with  magnitude 
exceeding  twice  the  inici\)uaaile  range,  may  be  excluding  a  mar 
ginal  outlier,  and  so  has  a  lower  cenainiy  score.  The  ranked  candi 
are  placet!  on  a  st<ick.  anti  llte  item  on  ilie  top  of  the  stack  Is 
removed  and  used  for  the  next  stage  of  the  analysis.  This  stack  is 
constructed  as  progress  is  matle  through  each  stage  of  the  analysis. 

When  a  single  candidate  motfel  has  been  fined  and  .scored, 
other  caiulitlate  mkhIcIs  at  that  stage  are  also  fitted  and  rai^ketl. 
Depending  on  the  t|ualiiy  of  the  fitted  models,  a  loop  hack  to  an 
earlier  stage  of  analysis  may  be  retjuired  bor  example,  if  candi 
date  .ARIM.A  (p.  1.0)  motlels  for  several  choices  of  AR  order  p  tio 
not  produce  residuals  with  the  desired  whiteness  properties,  then  a 
loop  back  to  a  motlel  structure  selection  phase  may  be  needed  to 
cotisider  .ARIMA  (p.  l.tji  nuxlels.  If  fits  still  seem  inadev|uaie.  a 
deeper  loop  back  to  a  model  differencing  order  stage  may  be  rc 
qmretl.  wliere  the  stack  containing  candidate  tliffercncing  orders  is 
used  i*'  obtain  the  ne.xi  candidate  differencing  order,  f-orward 
chaining  towartl  mode)  siructure  .selection  and  fining  hs  ihen  re 
sumed 

Several  impoitani  issues  in  the  above  scheme  need  to  be  ad¬ 
dressed  I'he  first  is  in  the  use  of  the  term  “ceaamty" .  Cfiis  is  not 
to  be  taken  in  a  formal  probabilistic  sense,  or  even  necessarily  In 
ilic  sense  of  a  belief  function  formalization  (Kanal  ci  ai.  (Vhb). 
Here  wc  mean  only  some  lu  itrisiti  Sionufi  which  ranks  dunces  at 
a  particulai  stage,  either  by  strength  v'f  a  numerical  result  (e  g-, 
obscrvcii  significance  level)  or  lack  of  other  viable  alternatives. 
Schemes  for  combining  iliese  unceii. unties  .icross  stages  may  re 
r}uire  mduMon  of  a  more  elaborate  calculus  of  uncenainiy.  how 
ever  This  is  particularly  true  when  multiple  candidate  models, 
derived  fiom  jimliiple  loop  backs  of  varying  depths,  t>eeil  to  be 
tinalh  ranked  tor  presentation  to  the  user 

Another  issue  is  the  determmaiioJi  ot  the  depth  of  a  loop 
h.ick  from  .i  p.tmcufai  st.igc  The  simplest  heuristic  rc(|uires  a 
loop  back  of  one  level  to  the  previous  stage,  wlieie  locally  ranked 
altcrn. Hives  arc  then  sorted  However,  rankings  of  alternatives  .it 
an  earlier  stage  nia>  now  need  to  he  updated.  n>/uliiiotitul  on  what 
had  been  observed  at  the  later  stages,  l  or  example,  if  residuals 
from  lilting  a  collection  ot  .AKIMA  (p.l.t))  models  show  residuals 
with  ilisiinci  ch.uigC'.  in  me. in  level  v'vei  the  senes  (a  rem.jinmg 
nonstaiionarityl .  an  expen  may  rank  an  aliernaiive  ARIMA 
(p.  1.1.1}  structure  verv  U>w.  and  wisli  to  loop  back  two  levels  to 
consider  .ARI.Vl.A  (p.2.0)  structures  Ifv  using  a  second  order  dil 
ferencing  operator,  insteaii  of  tlie  original  first  order  difference, 
(he  expert  hopes  (c»  renKue  the  observerl  lesulual  nonsiationaniy 
This  may  call  for  the  development  i>f  much  nu'rc  elaborate 
schemes  handling  uncen.iuitv  if  true  expert  ari.ilysis  sti.itegres 


are  to  he  siutlied  and  emulated.  I'his  is  .in  area  of  ongoing  re 
search. 

6.  AN  EXAMPLE  OF  THE  SYSTEM 

f-igures  t>  tliiougl)  iS  show  displays  h>r  several  of  the  latter 
stages  of  a  typical  signal  analysis  session.  The  aim  is  to  show  what 
the  user  interface  looks  like  and  wliat  types  i>f  data  are  displayed, 
l-igure  6  sliows  a  stage  where  edited  data  are  being  examined  to 
determine  if  a  variance  stabilizing  transform  is  rcciuircd.  The  ed¬ 
ited  data  (from  an  actual  ARIM.A  (2.1.0)  model)  arc  displayed 
along  with  a  table  of  summaiA  siaiisiics.  A  possilMe  first-order  non 
siaiionariiy  in  the  data  can  be  seen  in  the  plot  The  CURRENT 
AC'l'IXTl'Y  window  iiulicaies  that  no  variance-stabilizing  transform 
is  recoiiimciided-  An  icon  appears  in  the  left  ponion  of  the  win¬ 
dow  and  presents  options  which  the  ‘ser  can  select  by  clicking  on 
the  left  button  of  the  three  button  mouse.  Two  other  windows  are 
visible  in  tlie  display.  The  LISP  TISIHSER  window  provides  a 
deeper  level  of  access  to  the  system  iltan  i))e  average  user  will  em 
ploy.  The  .ANA1.>'SIS  SCRIPT  window  provides  a  iracc  of  the 
steps  in  the  analysis  that  have  been  performed  so  far.  Sucit  a  trace 
will  he  used  in  the  future  ioi  fonn(ih:ctl  rult-  ri.-fincnn'ni  based  on 
deductions  from  application  specific  data  .inalysis  sessions. 

f  igure  7  shows  an  analysis  of  the  translormetl  data  to  check 
for  nonsiationaritics.  A  simple  set  of  pattern  recogiution  rules  "ex¬ 
amine"  the  ACE  and  P.ACF.  classify  the  correlation  patterns  into 
one  of  five  characteristic  classes,  and  tlispla>  tlie  results  in  mouse- 
sensitive  icons.  Plots  of  the  .ACF  and  P.ACE  are  also  displayed  for 
il)e  user  to  view  and  analyze 

Figure  M  shows  a  displ.iv  of  the  full  COS'EAR  system  screen 
aher  a  model  has  been  fit  and  evaluated.  1‘lie  CURRENT  ACT1\'- 
FIT  window  has  five  new  icons  which  display  information  about 
the  filled  .ARl.MA  (1.1.1)  model  (ai^  incorrect  model  su'uciure  was 
fit  tor  illustration).  These  icons  are  for  display  only,  and  are  not 
subject  to  mouse  connol  Pliese  icons  ilisplay  the  fitted  model  pa 
ramcicrs.  a  composite  model  goodness  score  (POOR),  and  the 
(normalized)  numerical  results  of  a  Bo.\  Ejung  tliagnosiic  test  on 
the  residuals.  Anoilier  plot  lias  been  displayed  in  the  INTF- 
GKA'I  HD  PSD  wiiulow.  the  integrated  power  spectral  density  of 
the  residuals.  The  linearity  of  this  curve  reflects  the  degree  of 
whiteness  of  the  residu.ds  \arious  windows  and  displays  appear 
and  disappear  as  .ippropriate  in  fi.xcd  locations  on  tire  screen  as 
the  .malysis  progresses.  This  is  .iimed  at  .ivoiding  the  clutter  and 
confusion  that  can  result  from  loo  niuv  h  fle-xibiiiiy  in  window  gen- 
cr.ttion  and  reari.ingemeiu 

7.  SUMMARY  AND  FUTURE  WORK 

Ihis  paper  has  described  the  implemenianon  ot  COS  P.AR.  a 
pn>it>iype  expen  svsiem  signal  analysis  and  d.ii.i  processing  tool 
now  under  development  at  T.ASC.  Descriptions  of  tlie  system  ar 
cliitecturc  and  knowledge  base,  aiul  examples  ol  the  operation  of 
the  cuirem  s>Mem  have  been  given  Cuireni  woik  focuses  on  en 
h.incemeni  ol  the  system  knowletlge  base  and  el.tboratmn  ot  the 
nuHicliiig  process  control  struciuie  to  handle  the  complexities  of 
multiple.  muUi  stage  loop  b.icks.  f  uture  work  will  focus  on  the  use 
of  .malysis  traces  tor  automated  rule  refinement  in  fielded  systems, 
the  applic.iiion  and  development  ol  more  foimalized  validation 
procedures  tor  production  rule  systems,  and  the  "self  validation" 
ot  a  svstem  -eg.  procctlurcs  iltat  allow  the  svstem  to  tell  when  it 
m.i>  be  l.iced  witli  a  c>clost.iiionar>  process,  .i  nonsiation.u iiy 
process  which  cannot  be  atlevjuaicly  described  by  the  available 
nuulel  struciuie.'. 
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P-iiiUl'c  }  Object  Hierarchy  Ki^urc  5  Analysis  Rule  Hicrarcliy 
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->  Data  type  is  new  TRANSFORMED 

->  Transform  recommended  is  NONE 

-> 

->  L-Mouse  OPTIONS  to  CHANGE 
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c:i  iri(i:n  r  ac'  i  ivi  ty 


->  Data  type  is  TRANSFORMED 

->  Note  CORRELATION  plots 

->  Correlation  PATTERNS  are  shown 
->  L-Mouse  OPTIONS  to  CHANGE 
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->  Data  type  is  RESIDUALS 

->  Residuals  being  CHECKED 

->  Note  model  GOODNESS  score 
->  Note  DIAGNOSTIC  PLOTS 
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We  describe  the  design  of  PjTSSa,  a  computer  system  for  interactive  time  series  and  spectral  analysis. 
This  system  is  written  in  Lisp,  a  language  which  has  long  been  a  favorite  of  researchers  in  computer 
science  but  which  has  not  been  used  extensively  for  data  analysis.  Some  of  its  interesting  features  include 
the  use  of  object-oriented  programming  to  break  up  P[TSSa  into  a  large  number  of  separate  modules; 
a  systematic  way  of  defining  interactions  between  the  user  and  PjTSSa  via  a  graphical  input  device 
(“mouse”);  and  an  implementation  of  a  number  of  enhancements  to  a  standard  time  series  package 
which  leads  to  a  qualitative  improvement  in  interactive  data  analysis  for  time  series. 


1.  INTRODUCTION 

We  describe  in  this  paper  PjTSSa —  a  computer 
system  which  supports  interactive  time  series  and 
spectral  analysis  and  which  is  written  in  the  lan¬ 
guage  Lisp.  Our  system  is  part  of  an  effort  by  a 
small  number  of  investigators  in  recent  years  to  im¬ 
plement  some  of  the  ideas  put  forth  in  a  scries  of 
articles  by  McDonald  and  Pedersen  (1985a,  1985b, 
1988).  The  basic  thesis  of  their  work  is  that  inter¬ 
active  data  analysis  is  best  supported  by  hardware 
and  software  originally  designed  by  computer  sci¬ 
entists  to  efficiently  support  experimental  program¬ 
ming.  These  include,  on  the  hardware  side,  a  mod¬ 
ern  computer  workstation  with  high-speed  and  high- 
resolution  bitmap  graphics,  and,  on  the  software 
side,  a  computing  environment  such  as  is  provided 
by  a  modern  programming  environment  for,  say,  the 
language  Lisp. 

There  are  several  research  questions  we  are  cur¬ 
rently  investigating  with  PjTSSa,  among  which  we 
will  be  concerned  with  three  in  this  paper.  First, 
how  can  data  analysts  best  take  advantage  of  the 
opportunities  afforded  by  the  hardware  and  .software 
supplied  on  modern  computer  workstations?  Most 
(but  by  no  means  all)  of  the  widely  used  software 
systems  for  interactive  data  analysis  are  packages 
which  were  originally  designed  on  a  batch-processing 
system.  These  typically  make  limited  use  of  the  ca¬ 
pabilities  of  modern  workstations  (other  than  su¬ 
perficial  use  of  menus  to  replace  typing  of  certain 
commands).  A  few  systems  (such  as  S  (Becker  and 
Chambers  1984))  were  originally  developed  in  an  in¬ 
teractive  computing  environment  (such  as  UNIX). 
These  are  a  vast  improvement  over  batch-oriented 
systems,  but  it  is  necessary  to  be  somewhat  of  a 
computer  expert  to  augment  thetii  to  handle  graph¬ 
ical  interaction  with  a  user. 


Second,  is  is  possible  to  design  an  interactive 
data  analysis  system  which  is  accessible  by,  and  use¬ 
ful  for,  users  of  many  different  levels  of  sophistica¬ 
tion?  Typically,  “user  friendly”  systems  can  drive 
an  expert  user  to  distraction  with  their  bulky  use 
of  menus,  while  terse  systems  for  an  expert  are  in¬ 
comprehensible  to  novices.  Two  things  are  desir¬ 
able  here:  a  system  with  a  good  support  for  novices 
but  which  can  easily  be  “opened  up”  for  fundamen¬ 
tal  modifications  by  an  expert;  and  a  system  which 
grows  in  usefulness  as  novices  learn  more  and  more 
about  both  it  and  interactive  data  analysis  and  be¬ 
come  experts  themselves. 

Third,  what  new  forms  of  interactive  time  se¬ 
ries  and  spectral  analysis  arc  |)ossible  with  modern 
workstations?  Time  series  analysis  is  a  field  which 
has  been  particularly  influenced  by  available  com¬ 
puting  power.  Much  of  the  emphasis  on  lag  win¬ 
dow  (or  Blackman-Tukey)  spectral  estimators  in  the 
1950’s  was  due  to  computational  issues:  the  lag  wiit- 
dow  approach  allowed  spectral  estimates  to  be  cal¬ 
culated  from  only  a  small  number  of  sample  auto¬ 
covariance  function  values.  With  the  advent  of  the 
Fast  Fourier  transform  and  more  powerful  comput¬ 
ers  in  the  1960’s,  it  became  po.s.sible  to  use  spectral 
estimation  techniques  with  a  greater  computational 
overhead.  The  computationally  intensive  multiple 
taper  approach  to  spectral  analysis  (Thomson  1982) 
would  have  been  a  theoretical  curiosity  if  it  had  been 
introduced  15  years  ago.  'I'he  advent  of  com|>uter 
workstations  opens  up  new  avenues  for  qualitatively 
improving  interactive  time  series  analysis. 

I’iTSSa  is  a  continually  evolving  ex|>erim('nt 
which  seeks  to  addre.ss  tliese  (and  other)  i.ssues.  Onr 
first  working  version  w,as  (levelo|><'d  in  1984  on  a 
Symbolics  Lisp  Machine,  one  of  the  few  workstations 
at  that  lime  which  could  support  the  typ('  of  int<'r- 
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active  system  we  were  interested  in  designing.  The 
rapid  growth  and  development  of  computer  hard¬ 
ware  in  the  past  4  years  now  makes  it  feasible  to 
make  our  work  more  widely  available.  For  example, 
we  are  now  in  the  process  of  porting  a  portion  of 
PjTSS^  to  an  Apple  Macintosh  II,  which  is  more 
than  an  order  of  magnitude  less  expensive  than  a 
Lisp  Machine  was  four  years  ago. 

PjTSSa^  has  been  used  in  several  graduate  level 
classes  in  time  series  and  spectral  analysis  over  the 
past  four  years,  both  for  in-class  demonstrations  of 
data  ^ulalysis  techniques  and  for  use  by  students  in 
class  projects.  It  has  also  been  demonstrated  to 
dozens  of  colleagues  and  groups  of  visitors.  Discus¬ 
sions  with  individuals  who  have  either  seen  PiTSSa 
demonstrated  or  used  it  extensively  have  lead  to  a 
number  of  fundamental  changes  in  the  underlying 
design  of  PjTSSa.  While  we  are  fairly  happy  with 
its  form  as  reported  in  this  article,  we  plan  to  con¬ 
tinue  to  use  it  as  a  test  bed  for  new  ideas  in  the 
future. 

2,  WHY  LISP? 

The  question  posed  in  the  heading  is  the  one 
most  frequently  asked  by  colleagues  who  have  seen 
demonstrations  of  P|TSSa.  Our  rationale  for  em¬ 
bedding  a  statistical  system  within  Lisp  is  discussed 
in  the  subsections  below.  We  remark  here  that  there 
are  languages  other  than  Lisp  which  have  some  (or 
all)  of  its  desirable  features  and  would  be  a  rea¬ 
sonable  alternative  choice  (Smalltalk  is  a  prime  ex¬ 
ample).  Lisp  does  enjoy  considerable  popularity  in 
the  computer  science  community.  It  is  available  and 
supported  efficiently  on  a  number  of  different  com¬ 
puters  (ranging  from  special  workstations  designed 
specifically  to  support  Lisp  —  the  so-called  Lisp  Ma¬ 
chines  —  to  personal  computers).  This  advantage  is 
offset  somewhat  by  a  profusion  of  different  dialects 
of  Lisp  —  a  problem  which  stimulated  the  recent 
definition  of  Common  Lisp  as  a  proposed  .standard 
(Steele  1984). 

2.1  Interpretation  and  Compilation 

The  simplest  systems  for  interactive  data  anal¬ 
ysis  work  in  the  following  way.  The  user  types  in 
a  command;  a  command  proce.ssor  (supplied  by  the 
system  designer)  interprets  the  command  and  does 
something  —  returns  one  or  more  computed  values, 
assigns  a  value  to  a  variable,  or  displays  a  plot;  and 
the  user  looks  at  the  results  and  types  in  a  new  com¬ 
mand.  The  designers  of  such  systems  supply  the 
user  with  a  certain  number  of  basic  commands  with 
which  to  work.  Although  this  number  is  quite  large 


for  sophisticated  systems,  no  such  system  is  ever 
complete.  It  is  impossible  for  designers  to  anticipate 
the  needs  of  users  (particularly  since  interactive  data 
analysis  is  most  often  exploratory  in  nature).  The 
question  then  arises  as  to  how  to  let  the  user  extend 
the  statistical  system  to  met  his  or  her  needs. 

A  certain  amount  of  flexibility  is  introduced  by 
allowing  the  user  to  write  macros.  Macros  greatly 
decrease  the  amount  of  typing  which  a  user  must 
do  by  packaging  together  commands  into  groups  — 
a  single  macro  command  expands  into  many  basic 
commands.  However,  macros  have  two  problems. 
First,  the  user  is  really  using  the  command  language 
of  the  statistical  system  as  a  programming  language. 
This  means  that  the  design  and  complexity  of  the 
tcisks  which  can  be  accomplished  by  macros  is  usu¬ 
ally  limited  because  the  command  language  was  de¬ 
signed  to  convey  statistical  instructions  to  the  com¬ 
puter  and  not  to  be  a  general  purpose  programming 
language  (with  support  for  loops,  conditional  execu¬ 
tion,  block  structure,  complicated  data  structures, 
and  so  forth).  This  can  be  remedied,  of  course,  if 
the  system  designer  is  willing  to  take  the  time  to 
augment  the  macro  facility  to  include  many  of  these 
programming  features. 

A  second  problem  with  macros  is  that,  after  a 
user  invokes  a  macro,  the  commands  of  which  it  is 
composed  are  interpreted  and  executed  by  the  com¬ 
mand  processor  one  at  a  time.  If  the  macro  expands 
into  a  hundred  commands,  the  command  processor 
must  interpret  and  execute  each  of  these  commands. 
The  overhead  of  command  processors  is  usually  not 
negligible.  This  means  that  execution  of  macros  can 
be  quite  slow.  Again,  there  is  solution  to  this  prob¬ 
lem,  but  it  means  that  the  system  designer  must  pro¬ 
vide  for  compilation  of  macros  into  efficient  machine 
code  instead  of  just  interpretation  of  their  contents. 

More  sophisticated  statistical  systems  (such  ais 
S  (Becker  and  Chambers  (1985)))  provide  a  second 
(and  more  powerful)  way  for  a  user  to  deal  with 
problems  which  cannot  be  handled  by  the  beusic  com¬ 
mands.  Here  the  user  is  allowed  to  augment  the  set 
of  basic  commands  by  defining  new  ones.  The  code 
which  defines  these  new  commands  in  written  in  a 
full-fledged  programming  language  (such  as  Fortran 
or  C),  and  the  code  for  these  commands  is  compiled 
(i.e.,  translated  into  efficient  machine  code)  to  allow 
rapid  execution.  Once  defined,  these  new  commands 
enjoy  the  same  status  as  the  ba.sic  commands  sup¬ 
plied  by  the  system  designers. 

There  are  two  problems  with  this  way.  First, 
ideas  for  new  commands  are  often  inspired  by  the  re¬ 
sults  of  interactive  data  analysis  within  the  statisti- 
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cal  system.  That  means  that  the  new  command  has 
in  effect  already  been  implemented  once  in  the  form 
of  groups  of  commands  and  macros  which  have  been 
pieced  together.  The  user  must  begin  over  again  be¬ 
cause  the  command  language  of  the  statistical  sys¬ 
tem  and  the  programming  language  which  is  used  to 
define  commands  are  different. 

The  second  problem  is  the  lack  of  extensive  sup¬ 
port  for  debugging  within  a  statistical  system.  The 
designers  of  such  systems  usually  assume  that  code 
for  commands  has  been  thoroughly  debugged.  Un¬ 
fortunately,  subtle  bugs  often  occur  in  supposedly 
well-tested  routines.  If  this  occurs  in  the  code  for 
one  of  the  commands  in  a  statistical  system,  the  user 
usually  h2is  to  revert  to  writing  a  driver  program  in 
the  native  language  which  is  used  to  define  the  com¬ 
mands  and  do  the  debugging  entirely  outside  of  the 
statistical  system.  This  process  can  be  quite  time 
consuming. 

How  does  use  of  Lisp  improve  this  situation? 
Lisp  is  an  interpreted  language.  The  user  types  a 
Lisp  expression,  and  the  command  processor  for  Lisp 
(known  as  the  “reader”)  interprets  it  and  returns  a 
value.  The  value  the  reader  returns  can  be  as  sim¬ 
ple  as  a  single  number  or  as  complex  as  an  entire 
function.  By  this  means,  the  user  can  evoke  various 
defined  functions  and  set  (or,  in  Lisp  terminology, 
bind)  the  values  of  these  functions  to  various  sym¬ 
bols  for  later  reference.  It  is  rather  trivial  to  develop 
the  basic  structure  of  an  interactive  statistical  sys¬ 
tem  in  Lisp.  The  Lisp  reader  plays  the  role  of  the 
command  processor,  and  Lisp  functions  play  the  role 
of  basic  commands.  The  equivalent  of  macros  within 
Lisp  is  simply  other  Lisp  functions,  since  any  Lisp 
function  can  make  use  of  any  other  Lisp  function. 
The  real  task  is  thus  to  design  a  set  of  functions 
useful  for  interactive  data  analysis.  From  the  view¬ 
point  of  a  potential  user,  learning  enough  Lisp  to 
make  use  of  this  system  is  no  more  difficult  or  time 
consuming  than  learning  to  use  a  command-driven 
system  such  as  S  or  MATLAB. 

A  number  of  benefits  immediately  follow.  First, 
since  Lisp  is  both  a  interpreted  language  and  a  com¬ 
piled  language.  Lisp  statistical  “macros”  can  be  ex¬ 
ecuted  as  efficiently  as  any  other  function  in  Lisp. 
Second,  because  Lisp  is  a  full  fledged  programming 
language,  it  has  an  extensive  range  of  features  for 
loons,  conditionals,  and  handling  quite  complex  data 
structures.  The  user  does  not  have  to  try  to  pro¬ 
gram  in  a  command  language  designed  primarily  to 
facilitate  interactive  data  analysis.  Third,  because 
a  statistical  system  written  in  Lisp  uses  that  lan¬ 
guage  both  as  a  command  language  and  a  program¬ 


ming  language,  ideas  for  new  data  analysis  functions 
which  arise  in  the  course  of  an  interactive  data  anal¬ 
ysis  as  small  test  functions  can  be  readily  repackaged 
as  new  Lisp  functions.  Fourth,  because  Lisp  is  used 
so  extensively  in  the  computer  science  community, 
extensive  debuggers  have  been  built  for  it.  These  al¬ 
low  the  user  to  quickly  track  down  bugs  in  his  or  her 
program  (or  to  detect  errors  in  existing  Lisp  func¬ 
tions)  —  all  without  having  to  ever  leave  the  Lisp 
system. 

2.2  Complex  Data  Structures 

When  the  user  gives  a  command  to  the  com¬ 
mand  processor  of  an  ordinary  statistical  system,  the 
command  processor  returns  a  value.  In  simple  sys- 
tenns,  this  value  may  be  a  single  number  or  an  array 
of  numbers;  in  more  complex  systems,  it  may  be  a 
data  structure,  each  slot  of  which  contains  a  number 
or  an  array.  Command  processors  rarely  deal  with 
more  complex  data  structures  than  these.  In  con¬ 
trast,  the  Lisp  reader  can  return  values  which  are 
considerably  more  complex,  allow  greater  flexibility, 
and  correspond  more  closely  to  the  way  in  which  a 
statistician  thinks  about  a  problem.  The  following 
simple  example  illustrates  these  ideas. 

Suppose  that  we  have  a  Lisp  function  which  fits 
an  autoregressive  (AR)  model  to  a  time  series.  Now 
a  fitted  AR  model  can  be  used  to  estimate  the  spec¬ 
tral  density  function  (sdf)  of  a  series.  What  is  the 
best  way  to  represent  this  estimated  function?  The 
usual  approach  is  to  express  it  as  a  vector  of  values 
computed  over  a  grid  of  equally  spaced  frequencies. 

In  Lisp,  however,  we  have  the  option  of  rep¬ 
resenting  it  actually  as  a  function  —  the  result  of 
executing  a  Lisp  function  can  be  to  return  a  new 
function.  This  means  that  we  can  treat  the  esti¬ 
mated  sdf  as  a  true  function  —  it  can  be  numeri¬ 
cally  integrated  and  differentiated,  and  its  peak  val¬ 
ues  can  be  searched  for  to  within  any  precision  de¬ 
sired.  This  later  capability  is  particularly  important, 
since  a  common  error  in  displaying  an  AR  sdf  is  the 
failure  to  properly  evaluate  it  around  sharp  peaks 
(Burg  1975).  Representation  of  the  sdf  eis  an  actual 
function  makes  its  easier  to  design  code  to  do  this. 

2.3  Differont  Language  Paradigms 

The  predominant  language  paradigm  in  use  by 
data  analysts  is  called  procedure-oriented  program¬ 
ming.  This  is  the  style  of  programming  supported 
by  such  languages  as  Fortran  and  C,  where  the  ba¬ 
sic  module  is  a  subroutine  or  function.  Lisp  can 
support  this  style  of  programming,  but  it  has  also 
proven  flexible  enough  to  support  many  other  differ¬ 
ent  paradigms  proposed  in  computer  science  over  the 
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years.  Among  these  are  object-oriented  program¬ 
ming  (discussed  in  greater  detail  in  the  next  sec¬ 
tion),  constraint-oriented  programming,  and  access- 
oriented  programming.  Each  of  these  paradigms  is 
useful  in  certain  problem  areas  and,  in  particular,  is 
of  potential  use  to  support  interactive  data  analysis 
(see,  in  particular,  McDonald  (1986)). 

3.  WHY  OOP? 

Object-oriented  programming  (OOP)  has  been 
the  subject  of  considerable  attention  in  recent  years 
in  the  computer  science  community.  Its  use  in  the 
statistical  community  has  been  fairly  limited  (for  ex¬ 
amples,  see  Stuetzle  (1987),  McDonald  (1986),  and 
Oldford  and  Peters  (1986)),  but  wc  feel  that  it  offers 
a  number  of  advantages  for  constructing  a  statisti¬ 
cal  systems  over  procedure-oriented  programming. 
The  specific  features  of  OOP  which  have  facilitated 
our  development  of  PjTSSa  are  discussed  in  detail 
in  the  subsections  below,  but  first  we  make  a  few 
subjective  comments. 

OOP  is  somewhat  like  structured  programming 
in  that  it  gives  systems  designers  a  specific  approach 
for  organizing,  maintaining,  modifying,  and  extend¬ 
ing  large  programs.  For  some  problems,  it  seems 
a  more  natural  way  to  approach  the  programming 
problem  than  structured  programming,  because  it 
allows  program  design  to  follow  rather  closely  the 
way  a  program  appears  from  a  user’s  point  of  view 
(as  operations  on  various  objects  —  entities  with  a 
separate  identity  within  a  computer).  This  yields  a 
number  of  benefits.  First,  it  allows  a  user  more  eas¬ 
ily  to  develop  a  mental  model  of  how  a  program  will 
react  to  certain  actions  which  her  or  she  takes.  Sec¬ 
ond,  it  allows  a  system  designer  to  go  in  and  make 
changes  to  existing  code  more  quickly  —  if  the  code 
matches  the  way  a  program  is  perceived  to  work,  it 
is  easier  to  know  where  to  make  changes. 

While  OOP  has  enjoyed  considerable  success  for 
programming  problems  where  there  is  a  more  or  less 
natural  decomposition  of  the  problem  into  objects 
(such  as  in  the  simulation  of  a  paper  mill,  where  the 
objects  correspond  to  physical  entities  such  as  paper 
rollers),  it  is  less  obvious  how  it  can  be  used  to  con¬ 
struct  a  statistical  system.  We  hope  that  the  reader 
is  convinced  of  its  usefulness  by  the  end  of  this  pa¬ 
per.  We  can  report,  however,  that  we  are  now  on  our 
third  major  redesign  of  PjTSSa  in  a  four  year  period 
and  that  the  claims  of  2idvocates  of  objected-oriented 
programming  as  to  its  benefits  in  terms  of  modifia¬ 
bility  and  maintainability  are  true.  In  fact,  each  of 
the  redesigns  has  lead  to  the  definition  of  more  ob¬ 
jects  and  a  stronger  use  of  the  language  paradigm. 


However,  our  experience  also  sliows  that  there  are 
subtlies  in  OOP  that  are  not  apparent  to  the  novice 
(at  least  to  novices  who  were  initially  trained  in  the 
more  traditional  procedure-oriented  programming). 

3.1  Classes 

A  key  concept  in  OOP  is  that  of  a  class  of  ob¬ 
jects.  All  objects  in  a  particular  class  share  a  par¬ 
ticular  data  structure.  For  example,  one  class  in 
PjTSSa  is  called  ordered- x-y-pairs.  Every  ob¬ 
ject  in  this  class  h^  three  slots  (sometimes  called 
instance  variables).  The  symbolic  names  for  these 
slots  are  ordered-x-values,  y-values,  and  number-of- 
pairs.  Typically  ordered-x-values  and  y-values  are 
(pointers  to)  vectors  of  length  number-of-pairs,  and 
ordered-x-values  is  assumed  to  have  its  values  or¬ 
dered.  A  real-valued  irregularly  sampled  time  series 
of  length  100  could  be  (at  least  partially)  represented 
as  a  particular  object  of  this  class.  We  would  need 
to  bind  (aissign)  the  slot  ordered-x-values  to  a  vec¬ 
tor  of  length  100  with  the  times  at  which  the  time 
series  was  sampled;  the  slot  y-values  to  a  similar  vec¬ 
tor  with  the  values  of  the  time  series  at  each  of  the 
100  times;  and  number- of-pairs  to  the  value  100.  A 
second  object  of  this  class  (used  to  represent,  say, 
a  second  time  series)  typically  would  have  different 
bindings  (assignments)  for  one  or  more  its  slots. 

3.2  Generic  Functions  (Message  Passing) 

A  second  fundamental  notion  in  OOP  is  that  of 
a  generic  function.  We  illustrate  the  idea  behind  this 
concept  with  an  example  from  PjTSSa-  Two  of  its 
classes  are  called  real-time-series  and  complex- 
time-series,  which  are  used  to  represent  real-valued 
and  complex-valued  time  series,  respectively  (how 
they  are  related  to  the  cleiss  ordered-x-y-pairs  is 
discussed  in  the  next  subsection).  A  popular  way  of 
fitting  an  autoregressive  model  to  a  time  series  is  by 
means  of  Burg’s  algorithm  (Marple  1987).  There  are 
two  different  univariate  versions  of  this  algorithm, 
one  for  real-valued  time  scries,  and  one  for  complex- 
valued  series.  Suppose  that  we  create  a  generic  func¬ 
tion  called  “burg”  which  takes  as  input  an  object 
of  either  the  class  real- time- series  or  complex- 
time-series.  If  “burg”  is  a  generic  function,  its  def¬ 
inition  depends  upon  the  class  of  its  input.  Thus,  if 
we  apply  “burg”  to  an  object  belonging  to  the  class 
real-time-series  (complex-time-series),  its  defi¬ 
nition  would  be  a  routine  which  implements  Burg’s 
algorithm  for  real-valued  (complex- valued)  scries. 

An  equivalent  way  of  cxpre.ssing  this  idea  is  as 
message  passing.  Here  we  conceptually  send  a  mes¬ 
sage  to  an  object,  and  the  object  responds  in  a  par¬ 
ticular  way  —  the  respons<!  depends  on  what  claiss 
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it  belongs  to.  Thus,  if  we  send  the  message  “burg” 
to  an  object  belonging  to  the  class  real-time-series 
(complex-time-series),  the  object  responds  by  ap¬ 
plying  Burg’s  algorithm  to  the  real-valued  (complex¬ 
valued)  time  series  which  it  represents. 

There  are  two  distinct  advantages  to  this  ap¬ 
proach.  First,  we  can  reduce  the  number  of  com¬ 
mands  which  we  need  to  know.  We  need  only  re¬ 
member  that  “burg”  is  the  proper  message  to  pass 
to  (or  generic  function  to  use  with)  a  time  series  in 
order  to  evoke  Burg’s  algorithm.  There  is  no  need  to 
define  functions  with  slightly  different  names  (such 
as  “rburg”  and  “cburg”)  which  essentially  perform 
the  same  operation  but  for  different  types  of  time 
series. 

Second,  message  passing  allows  us  to  construct 
an  abstraction  barrier  between  usage  and  implemen¬ 
tation.  For  example,  when  we  pass  the  message 
“burg”  to  a  object  belonging  to  the  class  real-time- 
series  in  the  course  of  a  data  analysis,  we  really 
don’t  care  about  the  implementation  details.  The 
system  designer  may  well  have  implemented  Burg’s 
algorithm  for  real- valued  series  by  making  use  of  the 
complex  version  of  the  algorithm,  but  this  shouldn’t 
matter  to  the  user.  Conversely,  if  the  system  de¬ 
signer  decides  that  the  use  of  the  complex  algorithm 
for  real- valued  series  is  too  inefficient ,  he  or  she  can 
change  this  implementation  detail  without  disrupt¬ 
ing  users.  This  scheme  guarantees  the  user  a  certain 
response  from  a  certain  message,  yet  gives  the  de¬ 
signer  the  option  of  changing  the  underlying  details. 

3.3  Inheritance 

Inheritance  is  a  mechanism  in  OOP  which  al¬ 
lows  us  to  construct  new  classes  of  objects  based 
upon  modifications  to  existing  classes.  Again  we 
use  an  example  from  PjTSSa  for  illustration.  The 
objects  in  the  class  ordered-x-y-pairs  can  be  used 
to  describe  some  of  the  properties  of  a  time  series 
sampled  at  arbitrary  points  in  time.  Two  messages 
which  can  be  sent  to  an  object  of  the  class  ordered- 
x-y-pairs  are  “mean”  and  “mean-time.”  The  first 
returns  the  average  value  of  the  time  series  (i.e.,  the 
average  of  the  values  in  the  vector  bound  to  the  slot 
y-values),  and  the  second,  the  average  time  at  which 
the  observations  were  collected  (i.e.,  the  average  of 
the  values  in  the  vector  bound  to  the  slot  ordered-x- 
values). 

Now  the  class  real-time-series  is  intended  to 
represent  real-valued  time  series  sampled  over  an 
equally  spaced  grid.  The  class  is  obviously  quite 
similar  in  some  respects  to  ordered-x-y-pairs.  We 
may  take  advantage  of  this  similarity  by  defining 


real- time- series  such  that  it  inherits  all  of  the  slots 
and  ways  of  handling  messages  of  the  class  ordered- 
x-y-pairs.  As  an  example,  sending  the  message 
“mean”  to  an  object  of  the  class  real-time-series 
would  make  use  of  the  slots  and  message  handling 
inherited  from  ordered-x-y-pairs.  We  are  free, 
however,  to  define  additional  slots  and  way  of  han¬ 
dling  both  new  and  inherited  messages  for  our  new 
class.  These  would  express  the  difference  between 
the  intended  use  for  the  two  classes.  For  example, 
we  could  define  the  slots  sampling  time  and  first- 
time-value  to  replace  the  functionality  of  the  slot 
ordered-x- values  in  ordered-x-y-pairs.  The  values 
of  these  two  new  slots  can  be  used  to  generate  all  the 
time  values  for  a  time  series  sampled  over  an  equally 
spaced  grid  of  times  —  there  is  no  need  to  use  the 
vector  ordered-x-values  explicitly.  Likewise,  we  can 
redefine  how  the  message  “mean-time”  is  handled 
by  real-time-series  so  that  it  computes  it  using  its 
two  new  slots  and  the  slot  number- of- pairs  inherited 
from  ordered-x-y-pairs. 

The  advantage  of  using  inheritance  is  that  it  al¬ 
lows  us  to  construct  rather  complicated  objects  out 
of  simplier  one  in  a  way  that  clearly  expresses  the 
differences  between  related  classes.  This  is  a  quite 
useful  way  of  a  modifying  and  extending  a  large  soft¬ 
ware  system. 

4.  DESIGN  OF  PiTSSa 

There  are  three  major  classes  in  PjTSSa  — 
data  objects,  graph  objects,  and  frame  objects.  The 
first  represents  various  types  of  time  series  and  the 
results  of  processing  them;  the  second  is  used  to  con¬ 
struct  the  graphical  output  of  PjTSSa  on  a  bitmap 
display;  and  the  third  handles  the  user  interface. 
Our  discussion  below  of  objects  in  these  classes  is  far 
from  exhaustive  —  we  only  describe  a  few  of  each 
kind  to  give  the  reader  a  feel  for  the  organization  of 

PiTSSa. 

4.1  Data  Objects 

We  have  already  described  briefly  three  classes 
of  data  objects  in  Sections  .1.1  and  ,3.3  —  ordcred- 
x-y-pairs,  real- time- series,  and  complex-time- 
series.  There  are  many  others  which  are  used  to 
represent  various  other  kinds  of  time  series,  such 
eis  real-time-series-with-missing-valiies  (a  real¬ 
valued  time  series  sampled  regularly  but  with  miss¬ 
ing  observations)  and  vector-time-serics  (a  vector 
valued  times  series  sampled  regularly).  Each  of  these 
classes  has  slots  (or  inherits  slots  from  cotiiponent 
classes)  for  the  actual  values  of  the  time  series;  the 
symbolic  units  for  the  time  scries  values;  the  sam- 
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ling  time  and  Nyquist  frequency  (for  regularly  sam¬ 
pled  series);  the  network  name  of  an  ASCII  file  from 
which  the  values  of  the  time  series  were  read;  var¬ 
ious  results  from  statistical  computations  (such  as 
the  sample  autocoveiriance  function);  and  so  forth. 

When  a  message  is  passed  to  a  particular  data 
object,  the  object  responds  by  returning  either  a  sin¬ 
gle  value,  a  compound  data  structure,  or  another 
object.  Examples  of  these  for  the  class  real-time- 
series  are  the  message  “mean”  (which  returns  the 
average  of  the  values  in  the  time  series);  “aevP 
(which  returns  a  vector  with  the  values  of  biased 
estimator  of  the  autocovariance  function  and,  at  the 
same  time,  caches  them  in  a  slot  in  real-time-series 
for  possible  future  use);  and  “burg”  (which  returns 
an  object  of  the  class  arima-model-object).  In 
the  last  Ccise,  the  returned  object  can  itself  respond 
to  messages  —  for  example,  an  object  in  the  class 
arima-model-object  can  respond  to  the  message 
“sdf’  in  order  to  r^'turn  to  the  user  calculated  values 
(over  a  specified  grid  of  frequencies)  of  the  spectral 
density  function  for  the  ARIMA  model  described  by 
the  object. 

The  various  classes  of  data  objects  are  intended 
to  provide  basic  support  for  the  purely  computa¬ 
tional  aspect  of  interactive  time  series  and  spectral 
analysis.  With  just  the  classes  and  messages  defined 
here,  a  user  can  carry  out  a  data  analysis  by  typing 
in  Lisp  expressions  for  evaluation  by  the  Lisp  reader; 
the  expression  will  result  in  various  messages  being 
p£tsscd  to  various  objects;  and  the  user  can  assign 
the  resulting  values  (or  returned  objects)  to  various 
symbolic  names  for  latter  use.  This  mimics  com¬ 
pletely  the  interaction  that  typically  occurs  in  a  in¬ 
teractive  data  analysis  system,  with  the  additional 
advantages  of  generic  functions,  support  for  compli¬ 
cated  data  structures,  and  a  full-fledged  program¬ 
ming  language  supported  by  the  Lisp  programming 
environment. 

4.2  Graph  Objects 

Graph  objects  support  two  things;  displays  of 
results  of  various  computations  on  a  bitmap  termi¬ 
nal;  and  interaction  of  the  user  with  these  graphs 
by  means  of  a  “mouse”  and  a  keyboard.  Slots  in 
the  class  basic-graph-objcct  provide  for  lists  (or 
sets)  of  various  other  objects,  each  of  which  can  re¬ 
sponse  to  the  message  “draw-yours<‘lf.”  These  other 
objects  describe  various  portions  of  a  graph  -  the 
axes,  the  titles,  mouse-sensitive  regions  (i.e.,  areas 
on  the  graph  over  which  a  click  of  t  he  button  on  the 
mouse  causes  something  to  happen),  plots  of  data, 
and  so  forth. 


The  basic  way  in  which  a  bitmap  graph  is  con¬ 
structed  in  PjTSSa  is  by  attaching  drawable  objects 
(i.e.,  those  which  know  how  to  respond  to  a  “draw- 
yourself’  message)  to  an  object  of  the  class  basic- 
graph-object.  Once  this  has  been  done,  the  user 
sends  a  “draw- yourself”  message  to  that  object,  and 
it  in  turn  sends  “draw-yourselF  messages  to  the  ob¬ 
jects  in  its  list  of  drawable  objects.  This  makes  it 
relatively  easy  to  extend  PjTSSa  to  create  special¬ 
ized  graphs  —  the  user  need  only  define  an  appro¬ 
priate  class  of  objects  with  slots  to  support  the  de¬ 
sired  features  and  an  appropriate  definition  for  the 
“draw-yourselP  message. 

An  important  point  to  note  is  that  drawable 
objects  are  truly  separate  entities  in  PjTSS^^.  Thus 
a  drawable  axis  object  can  actually  be  on  the  list 
of  drawable  objects  for  several  different  objects  of 
the  type  basic-graph-object.  This  allows  us  to 
maintain  consistency  in  the  visual  representation  of 
related  graphs.  For  example,  we  might  have  several 
plots  of  different  spectral  estimators  for  the  same 
time  series.  If  the  objects  which  represent  these  plots 
each  has  the  same  drawable  axis  object  for  the  ver¬ 
tical  axis,  then  changes  to  this  object  (say,  in  its 
maximum  axis  value)  can  be  made  to  propagate  au¬ 
tomatically  to  all  graphs  of  which  this  axis  object  is 
a  component. 

4.3  Frame  Objects 

Frame  objects  arc  designed  to  support  the  user 
interface  in  PlTSSy\.  An  experienced  user  could 
carry  out  an  interactive  data  analysis  by  just  cre¬ 
ating  various  data  objects  and  graph  objects  and 
sending  messages  to  them.  Since  these  are  all  imple¬ 
mented  in  Lisp,  he  or  she  could  define  “on  the  fly” 
new  functions  (or  messages)  to  investigate  a  data'^et 
thoroughly  as  new  ideais  for  exploring  data  arise. 

However,  there  is  also  a  need  in  a  statistical 
analysis  system  to  carry  out  fairly  routine  proce¬ 
dures  (particularly  for  the  novice  user).  Frame  ob¬ 
jects  allow  us  to  support  these  by  defining  a  useful 
user  interface  for  particular  procedures;  by  packag¬ 
ing  together  sequences  of  calls  to  Lisp  functions  ap¬ 
propriate  for  a  particular  type  of  analysis;  by  storing 
important  values  returned  from  these  calls  in  slots 
in  the  frame  object  for  later  use;  and  by  causing 
the  display  of  bitmap  graphics  to  occur  by  attach¬ 
ing  appropriate  drawable  objects  to  objects  of  the 
class  basic-graph-object  and  sending  these  latter 
objects  the  message  “draw-yourself.”  All  of  these 
actions  occur  when  a  frame  object  receives  the  mc.s- 
■sage  “do-frame.” 

Each  cla-ss  of  fram<'  objects  thus  supports  only 
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Figure  1.  Screen  dump  of  the  bitmap  of  a  monitor  of  a  Symbolics  Lisp  Machine  showing  the  results  of 
sending  a  “do-frame”  message  to  an  object  of  the  class  windowed-periodogram-frame.  Th:  screen 
dumps  shows  two  distinct  plots  —  the  upper  one  shows  a  line  plot  of  an  estimate  of  the  spectrum  versus 
frequency  for  some  flow  data  from  the  Willamette  River  in  Salem,  Oregon,  while  the  lower  plot  shows  a 
point  plot  of  the  corresponding  time  series. 


on-  type  of  statistical  procedure.  Some  examples  of 
these  classes  are  windowed-periodogram-framo 
(described  in  the  next  section)  and  autoregressive- 
sdf-frame.  The  code  which  handles  the  “do-frame” 
message  for  each  frame  is  in  effect  a  small  script  of 
actions  for  carrying  out  common  procedures  in  time 
series  and  spectral  analysis.  As  such,  they  are  good 
places  to  look  for  ideas  on  implementing  new  proce¬ 
dures.  If  a  user  has  an  idea  for  a  particular  type  of 
procedure  which  is  close  to  an  existing  frame  proce¬ 
dure  in  PjTSSa,  he  or  she  can  often  modify  that  ex¬ 
isting  frame  (possibly  by  using  the  inheritance  mech¬ 
anism  of  OOP)  to  create  a  new  frame  class  which  is 
tailored  to  the  new  procedure. 

5.  AN  EXAMPLE 

In  this  section  we  give  an  example  to  clarify 
some  of  the  ideas  in  Section  4.  The  example  centers 
around  Figure  1 ,  which  is  a  screen  dump  of  the  moni¬ 
tor  of  a  Symbolics  Lisp  Machine.  I’his  display  shows 


the  result  of  send...g  a  “do-frame”  message  to  an  ob¬ 
ject  of  the  class  windowed-periodogram-frame 
(hereafter  referred  to  as  the  frame  object).  This 
message  comes  with  a  single  argument,  which  must 
be  an  object  of  either  the  class  real-time-series 
or  the  class  complex-time-scries  (hereafter  called 
the  time  scries  object).  In  the  present  example,  the 
time  series  object  represents  the  log  of  the  average 
monthly  flow  of  the  Willamette  River  at  Salem,  Ore¬ 
gon  from  1951  to  1984  (the  dots  on  the  bottom  plot 
of  Figure  1  show  a  plot  of  this  series  versus  time). 

After  the  frame  receives  the  “do-frame”  mes¬ 
sage,  it  presents  a  menu  of  options  to  the  user.  These 
concern,  among  other  things,  prewhitening,  data  ta¬ 
pering,  type  of  spectral  window,  and  associated  win¬ 
dow  parameter  (.see  Priestley  (1981)  for  a  discussion 
of  the  technical  details  of  spectral  analysis).  Af¬ 
ter  the  user  specifies  these  options,  the  frame  object 
sends  the  necessary  messages  to  the  time  series  ob¬ 
ject  to  calculate  a  windowed  periodogram  spectral 


estimate.  The  time  series  object  returns  one  of  the 
class  windowed-periodogram-spectral-object 
(hereafter  the  spectral  object)  to  the  frame  object 
—  the  spectral  object  represents  the  spectral  esti¬ 
mate  and  associated  information  (bandwidth,  vari¬ 
ance,  degrees  of  freedom,  etc.)  and  is  bound  to  a 
slot  in  the  frame  object  for  future  reference. 

The  frame  object  next  makes  use  of  two  ob¬ 
jects  of  the  clciss  basic-graph-object,  which  we 
hereafter  call  the  time  graph  object  and  the  spec¬ 
tral  graph  object.  The  time  graph  object  is  used 
to  create  a  bitmap  plot  of  the  time  series  associ¬ 
ated  with  the  time  series  object,  while  the  spectral 
graph  object  does  the  same  for  a  plot  of  the  win¬ 
dowed  periodogram  spectral  estimate.  To  set  this 
up,  the  frame  object  attaches  the  time  series  object 
(spectral  object)  to  the  list  of  drawable  objects  in 
the  time  graph  object  (spectral  graph  object).  The 
frame  object  also  attaches  axis  objects,  title  objects, 
and  mouse-sensit!"e  region  objects  to  th"  -ppr.^nri- 
ate  lists  in  each  graph  object. 

After  objects  are  attached,  the  frame  object 
sends  a  “draw-yourselT’  message  to  both  graph  ob¬ 
jects.  Each  graph  object  in  turn  sends  a  “draw- 
yourself”  message  to  each  of  the  objects  in  its  list  of 
drawable  objects.  The  results  are  a  large  plot  of  the 
spectral  estimate  versus  frequency  (the  .solid  line  in 
the  upper  part  of  Figure  1)  and  a  smaller  plot  of  the 
time  series  versus  time  (the  dots  in  the  lower  part). 

When  a  graph  object  sends  a  “draw-yourself’ 
message  to  a  mouse-sensitive  region  object,  a  small 
icon  is  drawn  in  the  margin  of  the  plot.  For  ex¬ 
ample,  the  right  margins  of  both  plots  in  Figure  I 
each  have  a  vertical  stack  of  icons.  The  top  four 
icons  in  each  stack  are  the  same  (a  right  arrow,  a 
graph  icon,  a  pencil  icon,  and  a  scissors  icon),  as 
ai-o  ttif.  bottom  two  (an  asterisk  icon  and  a  button 
icon).  The  upper  spectral  plot  has  three  additional 
icons  (a  kernel  icon,  a  harmonics  icon,  and  a  vari¬ 
ance/bandwidth  crossbar  icon),  while  the  lower  time 
plot  hsts  only  one  (a  shaded  window  icon).  Once  the 
plots  have  been  displayed  on  the  bitmap  screen,  the 
user  can  move  the  mouse  cursor  (an  arrow  pointing 
in  the  11  o’clock  direction)  until  it  is  over  a  par¬ 
ticular  icon  and  click  the  mouse  button  to  cause  a 
particular  mouse  state  object  to  be  activated  on  the 
corresponding  graph  object.  This  rnou.sc  state  ob¬ 
ject  is  then  used  to  interpret  any  mouse  clicks  the 
user  makes  over  any  portion  of  the  graph  which  is 
not  part  of  a  mouse-sensitive  region. 

Three  examples  of  the  interactions  po.s.sible  via 
the.se  rnouse-.sensitivc  regions  are  shown  in  Figure  1. 
If  the  u.ser  clicks  on  the  kernel  icon  (the  one  below 


the  scissors  icon  on  the  spectral  plot),  a  mouse-state 
object  is  activated  which  allows  the  user  to  draw 
the  kernel  /<’(•)  associated  with  the  windowed  pe¬ 
riodogram  spectral  estimate  h(  ),  where  the  kernel 
appears  by  the  relationship 


E{hiu)]  = 

J-\i 


K{w  -  X)h{X)dX  - 


here  N j  is  the  Nyquist  frequency  and  h(  )  is  the  true 
spectral  density  function  (A'(  )  depends  upon  the 
data  taper  and  the  spectral  window).  The  user  spec¬ 
ifies  where  K{  )  is  to  be  drawn  by  pointing  and  click¬ 
ing  the  mouse  button  —  this  defines  where  the  top  of 
the  central  lobe  of  A'(  )  is  to  be  plotted.  (Internally, 
the  drawing  is  accomplished  by  attaching  a  kernel 
object  to  the  list  of  drawable  objects  in  the  spec¬ 
tral  graph  object  and  sending  it  a  “draw-yourself’ 
message.)  In  f’igure  1  we  placed  A'(  )  at  the  up¬ 
per  right-hand  part  of  the  spectral  plot  (shown  as 
a  dashed  line).  Subsequent  clicks  allow  the  user  to 
relocate  K(  )  anywhere  else  in  the  central  plotting 
area  of  the  spectral  plot.  This  allows  the  user  to 
visually  assess  two  important  aspects  of  the  spectral 
estimate.  First,  if  there  are  sharp  features  in  /i(  ), 
these  will  essentially  cause  the  central  lobe  of  A'(  )  to 
be  traced  out.  This  does  occur  in  Figure  1  —  there 
is  a  sharp  feature  located  at  1  cycle  per  year  (corre¬ 
sponding  to  the  annual  flow  cycle  in  the  Willamette 
River),  and  the  spectral  estimate  in  that  region  has 
the  same  shape  as  A’(  )  (as  could  be  seen  quickly 
by  relocating  A'(  )  there).  Second,  the  height  of  the 
sidclobes  relative  to  the  main  peak  ais  compared  to 
the  observed  dynamic  range  in  h{  )  is  a  good  visual 
indication  of  whether  there  may  be  significant  bias 
in  h(  )  due  to  window  leakage.  This  is  evidently  not 
a  problem  in  our  example  —  the  sidelobes  of  A'(  ) 
decay  rapidly  compared  to  the  dynamic  range  ob- 
servc<i  in  h(  ). 

For  our  second  example,  we  describe  the  use  of 
the  harmonics  icon  (the  one  below  the  kernel  icon). 
This  icon  is  used  to  mark  (using  the  mouse)  the 
location  of  a  fundatnental  frequency  and  a  certain 
number  of  its  harmonics.  After  the  user  clicks  the 
mouse  over  this  icon,  a  motise  state  object  is  acti¬ 
vated  which  first  displays  a  small  menu  to  query  the 
user  about  the  number  of  harmonics  to  be  drawn.  In 
the  example  in  Figure  1,  we  requested  2  harmonics, 
f'rom  then  on,  the  mouse  state  object  interprets  a 
mouse  click  as  being  the  location  at  which  the  user 
wishes  to  have  a  marker  drawn  indicating  a  funda¬ 
mental  frequency.  'I'his  is  drawn  in  Figure  1  as  a 
solid  vertical  line  at  1  cycle  per  year  on  the  spec¬ 
tral  plot;  the  corresponding  first  two  harmonics  of 
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this  frequency  are  indicated  by  vertical  dashed  lines 
(agaun,  this  is  accomplished  internally  by  attaching 
a  drawable  harmonics  object  to  the  spectral  graph 
object  and  sending  it  the  message  “draw-yourself). 
This  allows  the  user  to  determine  whether  prominent 
features  on  a  spectral  estimate  are  harmonically  re¬ 
lated  and  can  thus  be  attributed  to  a  periodic  phe¬ 
nomenon  in  the  time  series.  (If  one  or  more  the 
harmonics  is  higher  than  the  Nyquist  frequency,  its 
alias  in  the  interval  (—Nj,  N/)  is  drawn  as  a  vertical 
drished  line  with  a  smaller  dcish  size.) 

Our  third  example  shows  how  extracting  a  sub¬ 
series  from  a  time  series  can  be  done  on  PjTSSa 
(this  can  be  useful  to  investigate  whether  any  visual 
differences  in  the  time  plots  of  various  subseries  map 
over  into  the  frequency  domain).  This  makes  use  of 
the  shaded  window  icon  immediately  below  the  scis¬ 
sors  icon  on  the  lower  plot.  Once  the  user  clicks 
over  this  icon  with  the  mouse,  a  mouse  state  ob¬ 
ject  is  activated  which  allows  the  user  to  place  two 
vertical  markers  on  the  time  series  plot  by  point¬ 
ing  and  clicking  with  the  mouse.  These  markers  are 
shown  in  Figure  1  as  a  vertical  solid  (dashed)  line 
near  1957  (1974).  Once  these  two  markers  are  in 
place,  the  user  can  request  that  the  windowed  peri- 
odogram  frame  calculations  be  repeated  using  only 
the  subseries  defined  by  the  markers  instead  of  the 
original  series.  This  can  be  done  by  positioning  the 
mouse  cursot  ovei  the  button  icon  eitlier  on  the  time 
series  plot  or  the  spectral  plot  (the  bottom  icon  in 
the  right-hand  column  on  either  plot)  and  clicking 
the  mouse  button  •—  this  causes  a  “clo-fraine”  mes¬ 
sage  to  be  send  to  the  frame  object  with  qualifiers 
which  cause  it  to  use  the  subseries  instead  of  the  full 
series  as  data. 

6.  CONCLUSIONS 

We  conclude  by  reconsidering  the  three  ques¬ 
tions  posed  in  Section  1.  First,  P|1'S.Sa  was  specifi¬ 
cally  designed  to  make  systematic  use  of  the  graphi¬ 
cal  interface  possible  with  a  modern  computer  work¬ 
station.  It  accomplishes  this  through  the  u.se  of 
mouse  state  objects,  drawable  mouse-sensitive  re¬ 
gion  objects,  and  other  drawable  objects.  These  ob¬ 
jects  share  a  common  interface  in  P(TSS^\  through  a 
set  of  common  messages  to  which  they  can  respond. 
This  uniformity  makes  it  clear  how  to  define  new 
classes  of  objects  to  extend  the  system  gracefully. 

Second,  the  use  of  frame  objects  allows  us  to 
define  a  high-level  graphical  interface  for  P|TSSa  in 
terms  of  more  fundamental  operations  on  data  ob¬ 
jects  and  graph  objects.  'I'he  response  that  we  have 
gotten  from  students  who  have  used  the  system  con¬ 


vinces  us  that  this  interface  is  useful  for  novices.  Our 
hope  is  that  the  overall  design  of  PiTSS^  is  trans¬ 
parent  enough  that  more  sophisticated  users  (other 
than  ourselves)  can  augment  and  modify  it  at  will 
(this  level  of  usage  remains  to  be  tested).  There 
are  really  no  constraints  imposed  on  a  sophisticated 
user  of  the  system  other  than  those  imposed  by  Lisp 
itself —  in  fact,  PlTSS,\  may  be  regarded  as  simply 
a  nefarious  plot  to  get  innocent  users  interested  in 
the  programming  potential  of  the  Lisp  environment 
itself! 

Third,  we  hope  that  the  three  simple  exam¬ 
ples  in  Section  5  convince  the  user  of  the  use."il- 
ness  and  power  of  interactive  graphics  in  time  series 
and  spectral  analysis  (and  other  areas  of  data  anal¬ 
ysis).  There  are  several  other  interesting  examples 
from  PjTSSa  which  we  plan  to  discuss  in  future  ar¬ 
ticles.  There  is  also  much  more  work  to  be  done 
in  this  fruitful  area  before  we  exhaust  the  potentials 
for  improving  interactive  data  analysis  pointed  to  by 
McDonald  and  Pedersen  (1985a,  1985b,  1988). 
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INSIDE  A  STATISTICAL  EXPERT  SYSTEM: 
Implementation  of  the  ESTES  system 

Paula  Hietala,  University  of  Tampere,  Finland 


ABSTRACT 

In  this  paper  we  describe  the  implementation  of  a 
statistical  expert  system  called  ESTES.  The  system  is 
intended  to  provide  guidance  for  an  inexperienced 
time  series  analyst  in  the  preliminary  analysis  of 
time  series.  The  ESTES  system  has  been 
implemented  on  Apple  Macintosh’^”  micro¬ 
computers  using  a  combination  of  Prolog  and  Pascal 
languages. 

Keywords:  Statistical  expert  systems;  Rules; 
Explanation  capabilities 


1.  INTRODUCTION 

Statistical  expert  systems  are  an  interesting  and 
novel  area  of  statistical  computing  today  (see  e.g. 
Chambers  (1981),  Gale  (1986a)  and  Hietala  (1987)). 
However,  the  implementations  of  these  systems  are 
often  outlined  very  cursorily  and  the  reader  is  left 
unaware  or  in  doubt  of  the  methods  employed  as 
well  as  of  the  inner  structure  of  the  systems.  The 
purpose  of  this  paper,  on  the  contrary,  is  to  consider 
in  more  detail  one  implementation  (of  a  statistical 
expert  system  called  ESTES)  in  order  to  give  a  better 
insight  into  these  popular  systems. 

The  ESTES  (Expert  System  for  TimE  Series 
analysis)  system  is  intended  to  provide  guidance  for 
an  inexperienced  time  series  analyst  in  the 
preliminary  analysis  of  time  series,  i.e.  in  detecting 
and  handling  of  seasonality,  trend,  outliers,  level 
shifts  and  other  essential  properties  of  time  series. 
In  the  preliminary  analysis  of  time  series  it  is 
usually  the  case  that  an  expert  time  series  analyst 
detects  the  essential  features  of  a  time  series  just  by 
examining  its  graphical  representation  and 
autocorrelation  function,  without  any  complicated 
calculations.  Even  in  the  case  of  an  inexperienced 
user  he/she  may  have  plenty  of  useful  knowledge 
concerning  the  environment  of  the  problem  in 
question.  With  this  in  mind,  the  statistical 
knowledge  in  the  system  is  organized  so  that  the 
system  tries  to  exploit  as  much  as  possible  of  the 
knowledge  or  experience  that  the  user  has  about  the 
specific  time  series  being  considered.  However,  if 
there  exists  a  conflict  between  the  results  computed 
by  the  system  and  the  knowledge  elicited  from  the 
user,  then  the  ESTES  system  sets  out  to  carry  out 
more  extensive  analysis  and  apply  more 
sophisticated  statistical  methods.  With  this  kind  of 
organization  we  strive  for  minimizing  the  number 
of  unnecessary  reasoning  and  calculation  steps. 


The  ESTES  system  has  been  implemented  on 
Apple  Macintosh^M  personal  microcomputers  using 
Prolog  and  Pascal  languages.  In  this  paper  we 
consider  the  overall  implementation  of  the  system. 
The  design  philosophy  and  user  interface  principles 
of  the  system  are  described  in  Hietala  (1986).  The 
organization  of  the  knowledge  base  and  the 
statistical  methods  employed  in  the  system  are  given 
a  detailed  treatment  in  Hietala  (1988). 


2.  STRUfTTURE  OF  THE  ESTES  SYSTEM 

The  structure  of  the  ESTES  system  and  the 
communication  between  its  principal  modules  is 
illustrated  in  Figure  1.  The  system  consists  of: 

-  a  main  module  which  takes  care  of 
communication  between  other  modules, 

-  a  statistical  knowledge  base  which  comprises 
the  knowledge  about  time  series  analysis, 

-  an  inference  engine  which  employs  the 
knowledge  in  the  statistical  knowledge  base, 

-  a  user  interface  module  which  interacts  with 
the  user, 

-  a  graphics  module  which  displays  graphical 
results, 

-  a  time  series  generation  module  which 
generates  example  time  series,  and 

-  numerical  computation  modules  which 
calculate  all  numerical  results  for  the  other 
modules  of  the  system. 

The  numerical  computation  modules  have  been 
implemented  in  Pascal,  all  the  other  modules  in 
Prolog. 

Next  we  briefly  describe  each  of  the  modules. 

2.1.  User  interface  module  and  graphics  module 

The  user  interface  of  the  ESTES  system  is 
especially  designed  for  an  inexperienced  user  (see 
Hietala  (1986)).  The  system  is  highly  interactive:  the 
user  interacts  with  the  systen;  using  pull-down 
menus,  dialog  windows,  overlapping  and 
transferable  data  windows,  with  a  mouse  as  a 
pointing  and  selection  device.  Figure  2  illustrates 
the  Macintosh-like  user  interface  of  the  system. 
Whenever  possible  both  numerical  and  graphical 
displays  of  the  data  and  statistics  (for  example, 
autocorrelations  and  partial  autocorrelations 
calculated  from  the  data)  are  offered  to  the  user. 
Also  the  shape  parameter  of  graphical  displays  (the 
ratio  of  height  and  width  of  a  figure)  may  be  chosen 
by  the  user.  For  example,  in  Figure  2  we  have  a 
graphical  display  of  data  (see  the  window  "Time 
Scries:  x").  If  the  user  wants  to  change  the  shape 
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Figure  1,  The  structure  of  the  ESTES  system. 
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Figure  2.  An  example  screen  of  the  ESTI^S  user  interlaa 
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parameter  of  this  display,  he/she  activates  the 
graphical  window  and  chooses  a  new  shape 
parameter  by  using  "Window  details"  command 
from  the  "Windows"  menu.  Also,  the  user  may 
activate  with  a  mouse  any  individual  point  in  the 
graphical  display:  the  value  of  time  series  and  and 
time  point  are  shown  in  the  display.  Moreover,  the 
numerical  display  of  the  data  (see  the  window  "Data: 
x")  is  scrollable  and  editable.  Changes  and  insertions 
in  the  data  window  are  immediately  seen  also  in  the 
graphics  window. 

The  visual  lexicon  of  the  system  (see  the 
Lexicon  menu)  gives  support  to  the  user  during 
his/her  work.  This  is  similar  to  the  lexicon  concept 
of  Gale  (1986b);  however,  in  our  system  the  lexicon 
illustrates  the  definition  of  unknown  statistical 
terms  graphically  and  explains  the  meaning  of  the 
terms  used  by  the  system.  For  example,  if  the  user 
asks  for  an  explanation  for  the  term  "trend",  the 
system  produces  both  a  graphical  representation  of  a 
time  series  with  a  trend  and  also  a  textual 
explanation  of  the  term  "trend".  The  visual  lexicon 
contains  a  rather  small  amount  of  precomputed 
information  for  the  graphical  representation;  the 
reason  is  that  we  strive  for  a  dynamic  lexicon.  By 
this  we  mean  the  ability  of  the  system  to  produce 
several  different  examples  of  a  phenomenon  in 
question.  We  have  implemented  this  feature  by 
including  generation  rules  to  the  lexicon  instead  of 
example  data  sets. 

Let  us  consider  our  example  situation  in  Figure 
2  a  little  more  closer.  Let  us  assume  that  the  user 
wants  to  remove  trend  from  time  series  x  by 
applying  non-seasona!  differencing.  Therefore  the 
system  has  inquired  about  "the  degree  of 
differencing"  (see  the  Differencing  window:  our 
system’s  suggestion  for  the  degree  of  differencing  is 
1).  Next,  the  user  likes  to  know  why  the  system  asks 
this  fact  (he/she  chooses  the  soft  button  "why"  in 
the  Differencing  window).  After  that,  the  user  can 
read  the  explanation  from  the  Why  explanation 
window.  However,  the  user  wants  still  more 
information  about  the  term  "degree  of  differencing", 
so  he/she  selects  the  term  in  question  by  marking 
from  the  Why  explanation  window  and  then 
requests  the  system  to  explain  this  term  through  its 
visual  lexicon  (the  user  chooses  the  corresponding 
action  from  the  Lexicon  menu). 

The  ESTES  user  interface  module  (e.g.  graphics 
windows  and  menus)  has  been  implemented  using 
LPA  MacPROLOG^'^  compiler  (see  Clark  et  al.  (1988)) 
and  its  advanced  graphics  tools.  Interestingly,  Prolog 
seems  to  be  well-suited  for  this  task;  only  a  few 
numerical  calculations  in  the  user  interface  module 
have  been  implemented  in  Pascal  for  the  sake  of 
convenience. 

2.2.  Knowledge  base  and  inference  engine 

A  knowledge  base  for  storing  the  expert’s 
knowledge  of  a  problem  domain  and  an  inference 


engine  for  inferring  solutions  and  explaining 
system’s  actions  are  at  the  very  heart  of  any  expert 
system:  this  is  also  the  case  with  statistical  expert 
systems.  Next  we  briefly  describe  the  principles 
employed  in  the  ESTES  system  in  implementing 
these  two  components.  A  more  detailed  account  of 
these  matters  can  be  found  in  Hietala  (1988). 

There  are  several  ways  of  representing 
knowledge  in  the  knowledge  base.  We  have  selected 
if-then  rules  for  representing  knowledge  concerning 
properties  of  time  series  and  their  handling.  Rules 
in  our  system  are  either  of  form;  RuieName:  if 
condition  A  then  conclusion  B,  or  of  form: 
RuieName:  if  condition_A  then  action_C.  This 
kind  of  rules  are  easily  expressed  in  Prolog;  they  are 
legal  Prolog  clauses  provided  we  define  appropriate 
Prolog  operators  (e.g.  ’:’,  ’if’,  ’then').  The  condition 
and  action  parts  of  a  rule  usually  include  also 
invisible  calls  to  Pascal  procedures  (see  Section  2.4 
for  a  more  detailed  discussion  of  the  interplay 
between  Prolog  and  Pascal). 

The  knowledge  base  of  the  ESTES  system  has 
been  organized  so  that  the  selection  of  a  class  of 
statistical  methods  will  be  determined  using  a 
hierarchy  of  criteria,  i.e.  according  to 

(1)  the  property  being  considered, 

(2)  the  granularity  of  analysis  process  (whether 
we  are  performing  initial  or  more  extensive 
analysis), 

(3)  the  goal  of  the  analysis  process  (detecting  or 
handling  the  property  in  question),  and 

(4)  the  knowledge  possessed  by  the  user  about 
the  specific  property  as  well  as  on  his/her 
general  knowledge  about  time  series  (the 
background  of  the  user  may  vary  from  a 
student  to  an  expert). 

Within  the  chosen  class  of  statistical  methods,  the 
final  selection  will  be  made  according  to  the  power 
of  the  methods,  i.e.  the  most  powerful  method 
available  is  selected  first. 

Although  the  Prolog  language  is  itself  an 
inference  engine  it  is  not  sufficient  for  our 
purposes.  We  do  not  use  Prolog’s  own  trace  facility 
but  have  built  an  interpreter  on  top  of  Prolog.  This 
interpreter  manages  the  reasoning  process  of  the 
ESTES  system;  it  interacts  with  the  user  during  the 
reasoning  process  and  also  after  it.  For  example, 
after  the  system  has  asked  the  user  about  some 
information  concerning  the  time  series  the  user  can 
ask  a  'why'  question  ("Why  does  the  system  inquire 
this  fact?").  Also,  after  the  system  has  completed  its 
reasoning  process  the  user  may  ask  'how'  questions 
("How  has  the  system  reached  this  conclusion?”), 
see  e.g.  Bratko  (1986).  Our  system's  reply  to  why  and 
how  questions  consists  of  displaying  a  user-friendly 
form  of  its  inner  inference  chain  with  explanations 
and  justifications  of  those  methods  that  are  used 
inside  the  chain.  In  addition  to  the  textual 
explanation,  the  system's  answer  can  contain 
displays  of  graphical  results. 
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2.3.  Time  series  generation  module 

Graphical  examples  in  time  series  analysis  (as 
well  as  in  other  branches  of  statistical  analysis)  can 
be  very  illuminating  and  instructive  foi  a.i 
inexperienced  analyst.  Besides  for  this  purpose  the 
time  series  generation  module  of  our  system  is  also 
utilized  by  the  lexicon  mechanism  when  producing 
examples  of  graphical  representation  of  statistical 
issues  inquired  by  the  user.  A  third  use  of  the  time 
series  generation  module  is  in  the  development 
phase  of  a  statistical  expert  system:  the  developer  can 
generate  various  test  series  (that  otherwise  would  be 
difficult  to  obtain)  and  examine  the  system's 
behaviour  with  respect  to  these  test  series. 

So,  the  user  first  generates  his/her  own 
example  data  (for  example,  a  specific  time  series  with 
a  property  he/she  is  interested  in)  using  the 
generation  feature  and  then  he/she  can  examine 
this  data  (time  series)  with  the  help  of  the  system. 
Thus  the  generation  feature  alleviates  the  learning 
of  preliminary  time  series  analysis:  the  user  can  very 
easily  get  acquainted  with  typical  time  series  and 
their  properties. 

The  ESTES  system  applies  ARIMA  models  in 
the  generation  process  but  the  user  does  not 
necessarily  need  to  have  knowledge  about  ARIMA 
models.  He/she  only  describes  the  properties  which 


he/she  wants  to  be  embedded  in  time  series  and  the 
system  chooses  the  appropriate  model.  But  if  the 
user  wants,  he/she  can  also  define  a  precise  model 
structure  for  the  time  series  generation  process. 

The  actual  computation  in  the  time  series 
generation  module  is  implemented  in  Pascal, 
employing  the  numerical  computation  modules  of 
our  system. 

2.4.  Numerical  computation  modules 

The  ESTES  system  like  other  systems 
performing  statistical  calculations  demands  quite 
heavy  numerical  computation  power.  Prolog  is  not 
designed  for  numerical  but  for  symbolic 
computation,  so  for  efficiency  reasons  we  have  not 
used  Prolog  for  numerical  computation. 
Computational  components  of  statistical  expert 
systems  usually  employ  some  existing  statistical 
software  package  or  are  programmed  in  an  ordinary 
procedural  language,  such  as  Pascal  or  C. 
Unfortunately  we  did  not  find  any  sufficiently 
flexible  existing  statistical  software  package  for  the 
implementation,  so  the  use  of  a  procedural  language 
(in  our  case,  Pascal)  for  all  numerical  computation 
was  necessary. 

Figure  3  illustrates  the  interplay  of  Prolog  and 
Pascal  languages.  On  the  left  we  have  a  fragment  of 


PROLOG:  lists 
Knowledge  Base 

rule_remove_trend: 
if  try_transformation 
is  ok 

or 

try_seasonal_ 

differencing  is  ok 
or 

try_differencing  is  ok 
then  remove_trend. 


remove_trend  :- 

/*  differencing  is  ok  */ 
call_pascal(n, 

TSlist,  DTSlist,...  ). 


TSlist 


DTSlist 


PASCAL:  arrays 

A  Numerical  Computation  Module 

function  pascal_routine_n 
(argc:  integer):  boolean; 
var  X,  dx:  array  [1..200]  of  real; 
t2,  t3  :  cellpo; 

begin 

t2  :=  geLarg(2); 
t3  :=  get_arg(3); 
list_to_array(t2,  x); 

differencing(x,  dx, ...); 
array_to_list(dx,  t3,....); 

pascal_routine_n  :  = 
SUCCESS; 

I  end.  I 


TSlist  =  list  of  time  series 
values 

DTSlist  =  list  of  differenced 
time  series  values 


X  =  array  of  time  series  values 
dx  =  array  of  differenced  time 
series  values 


Figure  3.  The  interplay  between  Prolog  and  Pascal  languages. 
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the  knowledge  base,  coded  in  Prolog  and  on  the  right 
one  of  the  numerical  computation  modules,  coded 
in  Pascal.  Let  us  assume  that  in  the  knowledge  base 
the  rule  ’rule_remove_trend'  is  selected  because  the 
condition  'try_differencing  is  ok'  is  true.  So  the 
action  ’remove_trend’  is  executed  next.  The  body  of 
this  action  contains  a  special  Prolog  predicate 
'call_pascal'  which  has  as  one  of  its  parameters  the 
list  of  time  series  values.  This  list  is  passed  to  the 
corresponding  Pascal  function,  which  converts  the 
list  to  an  array  and  then  carries  out  the  actual 
differencing.  After  that  the  results  are  returned  as  a 
list  to  Prolog.  To  the  user  of  the  system,  however, 
the  interplay  of  these  two  languages  is  hidden. 

3.  CONCLUDING  REMARKS 

The  ESTES  system  is  an  experimental  research 
vehicle  for  studying  the  use  of  artificial  intelligence 
(AI)  techniques  in  producing  statistical  expert 
systems.  Our  system  has  not  yet  been  tested  in  real- 
life  situations,  because  its  current  knowledge  base  is 
too  small.  In  the  near  future  our  main  emphasis  in 
the  development  of  the  system  will  be  in  deepening 
its  domain  knowledge  concerning  preliminary  time 
series  analysis. 

However,  we  think  that  our  system  rather 
nicely  embodies  the  two  faces  of  statistical  expert 
systems,  i.e.  the  deductive  component  (usually 
programmed  using  an  expert  system  shell  or  an  AI 
programming  language,  such  as  Lisp  or  Prolog)  and 
the  computational  knowledge  component  (which 
usually  employs  some  existing  statistical  software 
package  or  procedures  programmed  in  an  ordinary 
procedural  language,  such  as  Pascal  or  C).  In  our 
opinion,  this  "dynamic  knowledge  base"  (the 
computational  knowledge  component  outlined 
above)  is  very  typical  for  statistical  expert  systems. 

In  our  case,  the  use  of  a  combination  of  the 
languages  Prolog  (in  the  deductive  component)  and 
Pascal  (in  the  computational  knowledge  component) 
turned  out  to  be  a  very  suitable  way  of 
implementing  a  statistical  expert  system. 
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ABSTRACT 

This  paper  deals  with  the  problem  of  reasoning  about 
conceptualizations  (sets  of  relevant  parameters)  of  physical 
processes.  The  problem  is  discussed  in  the  context  of  the 
COPER  discovery  system.  COPER  conjectures  parameters 
characterizing  physical  processes  and  the  functional  rela¬ 
tionships  among  them.  The  COPER  system  utilizes  the 
idea  of  changing  representation  base  to  determine  the  argu¬ 
ments  of  invariant  functional  descriptions.  It  must  handle 
two  kinds  of  uncertainty  -  about  relevance  of  parameters, 
arid  measurement  error.  A  statistics/probability  approach 
has  been  used  to  estimate  the  effect  of  measurement  error 
in  the  COPER  system.  The  partially  adequate  results  of 
this  approach  are  presented.  Alternative  approaches  to  the 
measurement  error  problem  will  be  suggested. 

INTRODUCTION 

The  process  of  discovery  of  a  physical  law  involves 
reasoning  based  upon  experimental  data  obtained  from 
observations  and  measurements.  The  measurements  are 
never  perfect,  they  always  include  both  the  essential  infor¬ 
mation  about  the  behavior  of  the  physical  system  and  some 
noise.  A  discovery  system  must  be  able  to  perform  reason¬ 
ing  on  such  noisy  data  and  extract  the  causal  relationships. 
The  indication  for  existence  of  such  a  relationship  is  some 
regularity  in  the  data.  This  regularity  must  be  described  on 
at  least  two  levels:  some  conceptu^ization  (a  set  of  con¬ 
cepts)  must  be  defined,  and  then  a  relationship  should  be 
described  in  terms  of  these  concepts.  This  is  a  very  difficult 
task.  One  of  the  reasons  for  this  is  that  a  discovery  system 
must  deal  with  two  kinds  of  uncertainty: 

-  related  to  the  lack  of  knowledge  on  whether  the 

parameters  the  system  is  measuring  are  all  the 

relevant  parameters, 

-  caused  by  noise  in  the  input  data. 

This  paper  reports  how  these  two  problems  of  uncertainty 
are  handled  by  the  discovery  system  called  COPER  [Kokar 
1986a,  1986bj. 

Two  approaches  to  discovery  of  regularities  in  meas¬ 
urement  data  can  be  distinguished  -  let  us  call  them  piece- 
incremental  and  global.  In  the  piece-incremental  approach 
a  conjecture  is  made  after  getting  any  single  new  piece  of 
data.  In  the  global  approach  a  conjecture  is  made  after  all 
data  has  been  collected.  Between  the  boundaries  delineated 
by  these  two  approaches  there  is  room  for  all  kinds  of 
mixed  strategies  -  repetitions  of  collecting  of  some  amount 
of  data  and  drawing  an  inference.  The  mixed  strategies  can 
be  viewed  as  combinations  of  the  two.  After  we  collect 
some  amount  of  data  we  make  some  global  reasoning,  then 
the  conclusion  can  be  treated  as  one  piece  of  information 
which  can  be  input  to  the  piece-incremental  reasoning  sys¬ 
tem.  In  this  paper  we  concentrate  on  the  global  reasoning 
approach. 


We  introduce  some  measure,  which  is  the  fundamen¬ 
tal  tool  for  making  inferences  in  the  global  approach. 
Measures  are  functions  which  assign  numeric  values  to  sets 
of  observations.  The  measure  we  introduce  in  this  paper 
assigns  a  numeric  value  to  a  conceptualization  of  a  physical 
process.  It  is  based  on  the  predictive  power  of  a  conceptu¬ 
alization  -  the  better  the  prediction  the  lower  the  value  of 
the  measure.  The  utilization  of  such  a  measure  in  the  pro¬ 
cess  of  deriving  a  physical  law  from  observational  data  is 
obvious  -  the  system  generates  conceptualizations  of  a  phy¬ 
sical  process,  applies  the  measure  to  them,  and  selects  the 
conceptualization  for  which  the  value  of  this  measure  takes 
its  minimum. 

Such  a  statement  of  the  problem  might  suggest  that 
the  system  generates  a  model  of  the  process  and  then  tests 
its  predictive  power.  This  could  be  called  a  "traditional 
approach".  The  drawback  of  such  an  approach  is  that  we 
are  not  able  to  assess  which  part  of  the  model  is  responsible 
for  the  wrong  predictions.  In  the  approach  presented  in  this 
pajjer  we  construct  a  measure  which: 

-  is  able  to  assign  blame/credit  to  particular  parameters 
in  the  model, 

-  does  not  require  postulating  a  functional  dependency 
describing  the  model  (due  to  the  fact  that  the 
definition  of  this  measure  utilizes  the  principle  of 
similarity  we  call  it  the  "similarity  measure"). 

THE  SIMILARITY  MEASURE 

In  the  process  of  constructing  the  similarity  measure 
we  make  use  of  some  syntactic  properties  of  physical  laws. 
We  consider  here  the  laws  which  are  represented  by  some 
functional  formulas  (or  algorithms).  The  arguments  of 
these  functions  are  so  called  "dimensional  quantities",  i.e., 
a  numeral  followed  by  several  "units"  with  some 
exponents.  For  instance,  a  physical  quantity  of  "velocity"  is 
expressed  as: 

The  functions  describing  physical  laws  must  fulfill  some 
constraints.  For  instance  the  function  Y  =  3  m  +  5  kg  does 
not  have  an  interpretation  in  the  language  of  physics,  thus  it 
should  be  disallowed.  The  constraints  guarantee  that  by 
performing  some  syntactic  operations  on  a  representation 
of  a  physical  process  we  do  not  generate  some  objects 
which  are  not  interpretable  in  the  domain.  The  constraints 
are  capmred  by  the  requirement  of  dimensional  invariance 
of  functions  representing  physical  laws  with  respect  to  the 
change  of  the  representation.  Fomially  the  invariance  is 
represented  as: 


F(rX,,...,TX„)  =  TFtX, . XJ. 
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where  F  is  a  functional  formula,  are  physical 

parameters  (dimensional  quantities),  and  7  is  a  transforma¬ 
tion  of  the  representation.  The  discussion  of  the  syntactic 
properties  of  physical  laws  is  beyond  the  scope  of  this 
paper,  an  interested  reader  is  asked  to  refer  to  the  subject’s 
literature  (e.g.,  [Whitney,  1968],  [Birkhoff,  1960],  [Drobot, 
1953],  [Kokar,  1981,  1985]).  This  problem  falls  into  the 
field  of  dimensional  anlysis. 

In  the  theory  of  dimensional  analysis  a  theorem  exists 
(called  sometimes  the  Pi-  theorem),  which  says  that  any 
function 

Z=F(A . . 

which  is  dimensionally  invariant,  can  be  represented  as  a 
(new)  function 

Z=fiQx,...,Qr)AV 

where  Q\,...,Qr  are  the  new  pirameters  constructed  out  of 
the  initial  parameters  according  to  the  following  formula: 
Qj^BjlAT  ■  ■  ■  a"^. 

Dimensional  analysis  gives  the  rules  for  both  the  partition¬ 
ing  of  the  set  of  arguments  into  the  A-  and  B-arguments, 
and  for  calculating  the  values  of  all  the  exponents  in  the 
above  formulas. 

Suppose  a  physical  process  is  fully  characterized  by 
the  physical  parameters  Ai,...,A„,B|,...,B„  i.e.,  that  the 
value  of  some  characteristic  of  the  process  Z  can  be 
uniquely  determined  by  these  parameters  (functionality). 

By  an  "instance"  we  will  mean  one  measurement  of 
the  physical  process.  Formally  an  instance  R  can  be 
represented  as  an  m+r+1 -tuple; 

of  the  values  of  the  parameters.  When  carrying  out  experi¬ 
ments  with  a  physical  process  we  obtain  a  collection  of 
instances  usually  represented  as  a  table.  For  any  instance 
we  can  calculate  the  values  of  the  Q-parameters,  Qi(R). 

Two  instances  R'  and  /?"  are  called  "similar"  when  the 
following  relationship  holds: 

Q 1  (R')  =  Q 1  i.R"),-Mr{R')  =  Qr(R")- 
The  similarity  relation  is  an  equivalence  relation  on  a  set  of 
measurements  M.  Given  a  set  of  instances  (measurements) 
M,  a  conceptualization  C  =  {A\,...,A„,B\,...,Br,Z},  the 
similarity  relation  partitions  the  set  M  into  equivalence 
classes;  we  call  these  classes  "similarity  classes". 

The  formula  for  Z  can  be  transformed  into 

. . Qr)=ZIAV  ■a‘‘„’'=Q,. 

Given  a  conceptualization,  C,  and  a  set  of  instances,  M,  the 
similarity  measure  SM(C,M)  can  be  calculated  in  the  fol¬ 
lowing  steps: 

(1)  Using  dimensional  analysis  determine  the  forms  of  the 
monomials  Q\,...,Qr- 

(2)  Partition  the  .set  of  instances  M  into  similarity  classes. 

(3)  Using  the  above  formula  for  /  calculate  the  value  of 
thic  fiinz-tinn  for  each  instance  in  M.  Note,  that  to 
determine  these  values  we  do  not  need  to  make  any 
assumptions  about  the  form  of  the  function  /,  we  cal¬ 
culate  them  using  the  right  side  of  the  above  formula. 


(4)  For  each  similarity  class  calculate  the  mean  value  of  /. 

(5)  SM(C,M)  is  the  mean  value  of  all  absolute  differences 
between  the  calculated  values  of  the  function  /  and  the 
mean  values  of  /  for  every  similarity  class. 

PROPERTIES  OF  THE  SIMILARITY  MEASURE  AND  ITS  USE 
IN  REASONING 

The  similarity  measure  defined  in  the  previous  section 
has  some  very  useful  properties.  The  most  important  pro¬ 
perty  of  this  measure  can  be  summarized  in  the  following 
statement. 

If  a  physical  process  is  fully  characterized  by  the  con- 
ceptualizatinn  C  =  {A],...,A„,B],...,Br,Z},  i.e.,  Z 
functionally  depends  on  the  remaining  parameters, 
then  for  any  set  of  instances  M  of  this  process  the 
value  of  the  function  /  (or  Q^)  is  constant  for  any 
similarity  class,  and  consequently,  the  value  of  the 
similarity  measure  SM(C,M)  is  equal  to  zero. 

To  prove  this  property  of  the  similarity  measure  let  us  first 

notice  that  f  (Q\ . Qr)  must  be  constant  on  a  similarity 

class.  This  is  a  consequence  of  both  functionality  of  the 
relation  /,  and  of  the  definition  of  a  similarity  class  (a  cl.'  s 
is  determined  as  a  set  of  instances  for  which  all  the  Q’s  ai 
constant).  In  such  a  case  the  mean  value  of  this  function  is 
equal  to  the  (constant)  value  of  the  function,  and  thus  the 
difference  must  be  equal  to  zero.  This  is  tme  for  each  sin¬ 
gle  similarity  class.  The  similarity  measure  is  defined  as  the 
mean  value  of  the  absolute  difference  from  mean  value  of  / 
for  all  the  classes,  therefore  the  similarity  measure  must  be 
equal  to  zero. 

The  contraposition  of  this  property  says  that  if  the 
value  of  the  similarity  measure  is  not  equal  to  zero  then  the 
dependency  of  Z  on  the  conceptualization  C  is  not  func¬ 
tional.  This  means  that  some  of  the  parameters  in  the  con¬ 
ceptualization  are  missing.  If  a  parameter  B*  is  missing 
from  our  considerations,  then  as  a  consequence,  a  respec¬ 
tive  Qi^  parameter  is  missing  too.  It  means  that  in  our 
definition  of  similarity  classes  one  of  the  constraints, 
=  ■  ■■  =  constant,  is  not  taken  into 

account. 

The  use  of  this  measure  is  straightforward.  The  sys¬ 
tem  can  search  for  a  conceptualization  by  adding  one 
parameter  at  a  time  to  it  and  calculating  the  value  of  the 
similarity  measure.  If  the  value  of  the  similarity  measure 
improves  (significantly)  then  the  theoretical  parameter  is 
included  into  the  conceptualization,  otherwise  it  is  ignored 
and  the  search  continues.  The  search  stops  when  the  value 
of  the  similarity  measure  is  close  (enough)  to  zero.  One  of 
the  very  important  features  of  this  algorithm  is  that  a  physi¬ 
cal  parameter  does  not  need  to  be  varied  in  the  set  of  meas¬ 
urements  M  in  order  to  be  judged. 

The  similarity  measure  gives  us  some  means  to  handle 
uncertainty  about  the  set  of  parameters  characterizing  a 
physical  process  under  investigation.  Even  assuming  no 
noise  in  the  measurement  data  we  face  the  problem  of  how 
much  is  enough.  As  was  pointed  out  in  the  above  para¬ 
graph,  at  least  two  questions  need  to  be  answered:  what 
does  the  similarity  measure  improves  "significantly"  mean, 
and  what  is  close  "enough"  to  zero? 
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A  further  complication  is  introduced  by  the  presence 
of  noise  in  the  measurement  data.  Deciding  if  the 
significance  measure  has  improved  "significantly"  or  if  it  is 
close  "enough"  to  zero  can  be  confounded  by  the  presence 
of  measurement  error.  A  lower  value  of  the  similarity 
measure  could  mean  that  the  theoretical  parameter  just 
added  should  be  included.  Alternatively,  it  could  indicate 
improved  precision  in  the  measurement  data. 

To  answer  these  questions  requires  use  of  one  or  more 
of  the  models  of  approximate  reasoning,  e.g., 
probability/statisdcs,  rough  sets,  fuzzy  sets,  Dempster- 
Shafer,  and  so  forth.  Our  initial  analysis,  presented  below, 
attempts  to  use  the  probability/statistics  approach  to  evalu¬ 
ate  the  effect  of  noise  in  the  measurement  data. 

STATISTICAL  INTRACTABILITY  OF  THE  COPER  DECISION 
PROBLEM 

For  the  COPER  system  described  above,  one  knows 
that  a  complete  set  of  parameters  has  been  found  when  the 
similarity  measure  is  zero.  In  a  world  of  exact  measure¬ 
ments,  such  a  result  would  be  unambiguous.  However, 
measurement  errors  exist  for  physically  measured  quanti¬ 
ties.  Thus,  the  parameters  (variables)  which  COPER  con¬ 
siders  as  candidates  for  the  physical  law  under  investiga¬ 
tion  have  uncertain  value.  It  follows  that  the  completeness 
decision  is  not  necessarily  straightforward.  How  constant 
does  Qj  have  to  be?  When  is  the  observed  non-zero  value 
of  the  similarity  measure  due  to  measurement  error  of  the 
parameters  and  when  is  it  due  to  missing  parameters? 

Our  application  of  probability/statistics  to  this  prob¬ 
lem  attempts  to  answer  the  following  question.  Given 
information  on  the  distribution  of  errors  for  the  physical 
entities,  what  is  the  resulting  error  distribution  for  Qz-  We 
will  assume  that  the  measurement  errors  in  the  parameters 
are  normally  distributed  with  a  mean  of 
zero  and  some  known  standard  deviation.  This  is  a  com¬ 
mon  assumption  for  measurement  error.  Since  Qj  is  a 
function  of  the  mea.sured  parameters,  we  are  concerned 
with  transmittal  of  variation  through  this  functional 
relationship.  Mathematical  statistics  would  describe  the 
situation  as  a  function  of  random  variables  problem.  There 
are  analytic  solutions  to  many  such  problems.  As  many 
readers  might  know,  if  one  sums  independent  normal  ran¬ 
dom  variables,  the  result  is  also  normal  (Mendenhall, 
1986).  If  one  multiplies  independent  random  variables 
under  conditions  where  no  one  random  variable  is  dom¬ 
inant,  the  natural  log  of  the  result  is  a  log  normal  distribu¬ 
tion  [Lewis,  1987).  However,  the  COPER  problem  is  not 
limited  to  simple  sums  or  products  of  parameters.  Many  of 
the  transformations  of  interest  require  division.  To  describe 
the  error  distribution  for  we  must  determine  the  vari¬ 
ance  of  Qz-  However,  it  can  be  shown  (see  Appendix  A) 
that  the  variance  of  (1/X),  for  X  a  normally  distributed  ran¬ 
dom  variable,  can  not  be  found  analytically.  Thus  division, 
one  of  the  more  common  transformations  used  by  COPER, 
immediately  removes  the  problem  from  the  realm  of  ana¬ 
lytic  solutions.  We  present  below  an  approximate  solution 
to  the  problem. 


APPROXIMATE  STATISTICAL  SOLUTION  TECHNIQUE 

We  have  designed  an  simulation  approach  to  deter¬ 
mine  the  error  distribution  of  Qz  under  certain  assumptions. 
For  any  particular  analysis  problem,  there  are  a  limited 
number  of  possible  groupings  of  the  A-  and  B-parameters 
into  the  Q-  parameters.  Each  grouping  imposes  a  particular 
functional  form  for  Qz-  We  define  a  representative  problem 
which  has  several  functional  forms  for  Qz-  We  then  ran¬ 
domly  assign  errors  to  the  parameters,  and  analyze  the  sta¬ 
tistic^  variation  in  Qz-  If  this  statistical  variation  in  Qz  can 
be  shown  to  fit  a  known  distribution,  we  argue  that  func¬ 
tional  transformations  of  similar  form  will  result  in  the 
same,  known  distribution  of  error  variation. 

Given  this  premise,  we  assume  that  measurement 
errors  are  normally  distributed  with  a  mean  of  zero  and  a 
known  standard  deviation  which  is  a  percent  of  the  meas¬ 
ured  value.  We  simulate  a  large  number  of  observations 
(with  random  measurement  errors)  for  each  functional  form 
of  Qz.  We  hypothesize  that  the  variation  in  Qz  is  normally 
distributed  with  a  mean  of  zero.  The  size  of  the  standard 
deviation  of  Qz  will  depend  on  both  the  Q^-form  and  on  the 
magnitude  of  the  measurement  errors. 

We  present  results  of  the  simulation  for  three  func¬ 
tional  forms  of  Qz-  The  size  of  the  measurement  error  will 
be  varied  to  determine  the  effect,  if  any,  on  the  distribution 

ofQz- 

DESCRIPTION  OF  SIMULATION  MODEL 

The  simulation  model  assumes  a  complete  representa¬ 
tion  of  Qz  -  That  is,  a  representation  for  which  the  similarity 
measure  would  equal  zero  if  no  measurement  error  exists. 
Any  variation  we  observe  in  Qz  will,  thus,  be  due  to 
transmission  of  measurement  errors  of  the  parameters. 

Newton’s  law,  for  example,  has  as  one  complete 
representation,  Qz=s/vt.  We  will  describe  the  simulation 
approach  for  this  example.  We  similarly  treat  the  additional 
forms  of  Qz  to  be  investigated.  The  user  selects  specific 
values  for  a,  v  and  r,  s  is  calculated  from  Newton’s  Law 
{s=0.5at^+vt).  Likewise  the  true  value  for  Qz  is  defined  by 
.1,  V  and  t  (Qz=slvt).  A  set  of  measurements  are  simulated 
for  these  values  of  s,  v  and  t  by  generating  and  adding  nor¬ 
mally  distributed  independent  random  errors.  These 
instances  belong  to  one  similarity  class.  The  user  can 
assign  the  standard  deviation  of  the  error  as  a  fraction  of 
the  measured  value. 

The  exact  value  of  Qz  is  known.  A  "real"  value  of  Qz 
is  calculated  after  the  measurement  errors  are  introduced 
into  the  data.  Then  the  difference  between  the  exact  and 
the  actual  (with  errors)  value  of  Qz  is  determined.  The  pro¬ 
cess  is  repeated  for  a  large  number  of  times  (n=l()0  in  our 
examples  below).  The  mean  and  variance  of  Qz  (that  is,  the 
function  /  above)  are  calculated  for  these  100  instances. 
Note,  the  similarity  measure  is  the  sum  of  absolute  differ¬ 
ences  between  the  mean  of  Qz  and  each  instance  divided  by 
the  number  of  instances.  The  statistical  distribution  of  Q,  is 
tested  for  fit  to  a  Normal  distribution  using  the  chi-squared 
test. 
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ANALYSIS  OF  SIMULATION  RESULTS 


EXPERIMENTAL  DESIGN 

We  are  analyzing  the  effect  of  three  factors.  Both  the 
size  of  measurement  error  and  the  parameter  with  error 
may  affect  the  resulting  error  in  the  similarity  measure.  The 
thirf  factor  is  the  functional  form  of  the  relationship 
between  Qj  and  the  measured  variables.  In  our  case,  the 
functional  form  is  fixed  for  a  particular  situation.  It  is,  one 
might  say,  defined  by  the  dimensional  analysis  being  pur¬ 
sued.  Changing  the  size  of  measurement  error  within  one 
functional  form  will  provide  a  complete  analysis  for  this 
particular  functional  form. 

Thus  our  experimental  design  reduces  to  a  two  factor 
design  with  replications  for  the  different  functional  forms 
of  interest.  A  typical  two-factor  experiment  measures  the 
effect  of  two  factors,  say  pressure  and  temperature,  on 
yield.  In  such  a  case,  the  recommended  procedure  is  to 
change  the  factors  together  and  not  one  at  a  time.  The 
"together"  approach  follows  the  response  surface  and 
avoids  false  conclusions  which  might  otherwise  result.  We 
follow  this  approach  in  our  experimental  plan. 


We  found  strong  support  for  the  assumption  that  the 
errors  in  are  normally  distributed  if  the  measurement 
errors  are  small.  For  measurement  errors  with  standard 
deviation  of  less  than  5  percent,  the  distribution  of  errors 
easily  passes  the  chi-squared  test.  If  the  errors  are  larger 
than  10  percent  and  if  the  variable  appears  only  in  the 
numerator  of  Q^,  the  normality  assumption  continues  to 
pass  the  chi-squared  goodness-of-fit  test.  However,  for 
errors  of  more  than  10  percent  for  a  term  appearing  in  the 
denominator  of  Q^,  the  normal  assumption  is  rejected  far 
more  frequently  than  would  occur  by  chance.  The  results 
are  even  more  non-normal  if  the  variable  is  raised  to  a 
power  in  the  denominator. 

The  magnitude  of  the  error  in  is  quite  consistent  if 
errors  are  small.  For  errors  of  less  than  5  percent,  the  error 
in  is  between  1.7  and  2.5  times  the  original  measure¬ 
ment  error.  The  larger  error  transmission  occurs  when  a 
variable  is  raised  to  a  power  in  the  denominator. 


Each  row  in  Table  1  summarizes  twenty-five  simula¬ 
tion  experiments  for  each  of  three  functional  forms.  Tfie 
first  five  rows  monitor  the  effect  of  increasing  measurement 
error  in  all  variables  simultaneously.  The  next  twelve 
measure  the  effect  of  one  variable  having  much  larger  error 
that  the  others.  The  chi-squared  test  for  normal  divides  the 
100  observations  from  each  simulation  into  six  cells. 
Expected  frequencies  are  calculated  using  sample  mean 
and  sample  standard  as  estimates  for  population 
parameters.  The  calculated  chi-squared  test  statistic  is  com¬ 
pared  to  a  rejection  value  for  alpha  =  0.01  and  three 
degrees  of  freedom  (critical  value  =  1 1.343). 
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Results  of  Simulation  Runs 
Sample  Size  of  100  for  Each  Simulation 
25  Simulations  for  Each  Functional  Form  of  Q, 
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APPENDIX  A 

Consider  the  variance  of  (1  /Z)  where  Z  is  a  normal  random 
variable  with  mean  =  0  and  standard  deviation  =  1.  Note, 
any  normal  random  variable  X.  with  mean  =  |i  and  standard 
deviation  =  o,  can  be  transformed  to  a  standard  normal  Z 
using  the  relationship: 

Z=(X-ii)/cs. 


CONCLUSIONS 

Analysis  of  the  COPER  similarity  measure  using  the 
probabilistic/statistical  approach  is  helpful  but  in  a  limited 
way.  One  must  assume  that  measurement  errors  (noise  in 
the  input  data)  are  fairly  small  (as  a  percent  of  measured 
value)  and  that  the  errors  are  themselves  normally  distri¬ 
buted.  In  such  a  case,  the  uncertainty  in  Qz  resulting  from 
measurement  error  is  normally  distributed.  In  addition, 
given  these  same  assumptions,  the  standard  deviation  of 
this  transmitted  measurement  error  is  between  1.7  and  2.5 
times  the  original  measurement  error  when  defined  as  a 
percent  of  the  "true"  value.  The  1 .7  multiplier  applies  when 
only  linear  functions  of  the  parameters  appear  in  the 
denominator  of  The  2.5  multiplier  applies  when  the 
squared  value  of  a  parameter  appears  in  the  denominator  of 
Qz- 

We  plan  to  incorporate  these  results  into  COPER. 
Given  information  about  the  size  of  the  measurement 
errors,  the  system  can  decide  if  the  error  in  is  well 
defined.  That  is,  if  the  situation  fits  the  small  error  condi¬ 
tions  described  in  the  simulation  analysis  above.  The  sys¬ 
tem  can  further  decide,  based  on  the  functional  form  of  Q^, 
the  approximate  size  and  distribution  of  errors  in  Q^.  Thus, 
under  the  limited  conditions  defined  above,  COPER  will 
incorporate  analysis  of  uncertainty  caused  by  noise  in  the 
input  data. 

However,  a  full  resolution  has  not  been  attained.  We 
plan  to  evaluate  alternative  approaches  such  as  fuzzy  set 
theory  and  Dempster-Shafer  belief  functions  for  dealing 
with  uncertainty  in  our  attempts  to  expand  the  decision 
rules  for  dealing  with  uncenainty  within  the  COPER  sys¬ 
tem. 
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Variance  1/Z 

=  E{l\/Z-E(\lZ)f)  =  E{\/Z^-2{\IZ)(E(\/Z)+[E(\lZ)f} 
=  £(1/Z^)-1£(1/Z)]2 

Consider  E(\/Z^) 

+«  +«o 

£(l/Z^)  =  I  \lz^f(z)dz  =  J  l/z^l/2e~*^dz 

Consider  the  interval  from  -1  to  +1,  for  this  interval 
/(z)  (=  1 12e~^  )  has  an  upper  bound  of  +1/2  and  a  lower 
bound  of  1 /2e“' .  Let  K  =  1 /2e"' . 


Then 

+1  +1  <-1 

j  (l/z^)  1/2  e-^^dz  >  J/((l/z^)dz  =Xj}/z^dz 

-1  -1  -I 

But  this  integral  has  a  value  of  infinity,  thus  £(1/Z^’  is 
greater  than  or  equal  to  infinity.  And  therefore,  the  vari¬ 
ance  of  (l/Z)  can  not  be  found  analytically. 
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ABSmCT 

Planning  and  eBtinating  work  effort  for  a  project  is  one  of  the  noet  difficult 
activities  in  Project  Management.  It  is  also  a  critical  activi^  since 
preliminary  estimates  translate  directly  to  estimated  costs.  Projects  with  lo^r 
estimated  costs  end  i5>  with  insufficient  funding,  %Aule  projects  \inth  hi^ 
estimates  are  «i»iyly  not  considered  for  development.  TO  assist  with  estimation  a 
strategy  integratir^  knowledge-based  techniques  with  procedural  techniques  is 
proposed.  An  Integrated  Planning  Model  (IHl)  based  on  this  concept  is  described 
in  this  p^ier. 


1.0  Introduction  to  Project  Management 

Managing  a  project  using  the  critical  path 
method  typically  involves  three  iiain  stages- 
Planning,  Scheduling  and  Control. 

During  the  Planning  stage  the  project  is 
broken  down  into  snaller  more  nanageable 
ccnponents  called  activities.  Work  effort  is 
estimated  for  each  of  these  activities  using 
past  experience  as  a  guide .  Formulas  or 
Models,  and  actual  historical  data  are  also 
used. 

Once  the  work  effort  is  ccnpleted  a  network 
is  created  to  show  the  sequence  of  activities 
that  make  up  the  entire  project.  Several 
project  nenagement  tools  are  available  for 
the  above  purpose.  They  range  from  powerful 
mainframe  based  products  such  as  IBM's 
Application  System^  to  relatively  smaller 
project  management  tools  based  on 
microconputers . 

The  next  stage  is  Scheduling  -  here  we  nap 
the  activities  to  a  calendar,  and  determine 
start  and  finish  dates  for  each  task. 

The  last  stage  is  Control,  cind  this  ensures 
that  the  entire  project  is  cotipleted  on  time 
and  within  budget.  Good  control  also  ensures 
that  the  end  products  au:e  of  good  quality. 


2.0  Autcmated  Appanoach 


always  broken  down  finely  into  taslcs  and 
activities.  Any  new  project  will  therefore 
have  some  activity  that  v®s  performed  earlier 
(e.g.,  all  projects  involve  creating  a  users 
manual).  It  is  then  p»ssible  to  borrow  such 
estinates  for  the  new  project.  To  quote 
Meilir  Page-Jones,  "the  best  way  to  assign  a 
cost  to  a  given  task  is  to  identify  the  cost 
for  an  identical  task  performed  earlier  in 
the  shop. "  2 

Uhfortunately  few  project  managers  have  the 
opportunity  to  do  so,  as  no  useful  data  about 
previous  attarpts  ever  get  recorded.  Several 
explanations  are  offered  for  this  lack  of 
data  by  Meilir,  including  people  have  no 
time  to  collect  project  data",  and  "Nobody  in 
this  shop  has  the  statistical  skills  to 
apply  the  collected  data  meaningfully." 

2 . 1  Metrics  Groi^ 


To  get  around  this  problem  some  Data 
Processing  shops  have  estciblished  a  Metrics 
Group.  3  The  members  of  this  group  are 
specialists  in  measurement  and  estiroting  and 
they  acquire  their  skills  over  many  projects. 
They  also  function  independently  off  the 
project  manager  and  therefore  are  not  subject 
to  political  pressure  and  bias. 


There  are  several  advantages  with  this 
approach; 

a)  Members  acquire  specialised  estimation 
skills. 


While  several  software  tools  are  available 
for  project  management,  few  provide 
assistance  with  estineting.  Existing  tools 
assist  project  developers  only  after 
activities  have  been  defined  and  the  work 
effort  estimated.  Subsequently,  useful 
networks  are  drawn,  the  Critical  Path  traced, 
reports  generated,  and  graphs  such  as  GANTT 
drawn. 

At  the  outset  it  would  appear  that  it  is 
meaningless  for  any  tool  to  support  work 
effort  estiiiHtion;  after  all,  every  project 
is  different  from  another!  But  on  further 
cuialysis,  it  is  evident  that  this  logic  is 
incorrect.  As  explained  earlier,  projects  are 


b)  They  will  have  acquired  adequate 
statistical  skills. 

c)  Project  Managers  and  other  developers  can 
rely  on  this  group  to  obtain  better  results. 


But  there  are  also  several  disadvantages: 

a)  Maintaining  such  an  exclusive  group  can  be 
expensive . 

b)  Benefits  will  be  seen  only  after  the  group 
is  well  established. 

c)  High  staff  turnover  in  this  group  can  be 
devastating . 

d)  Dividing  authority  between  the  Metrics 
group  and  the  Project  Manager  Ccin  be  tricky. 
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2 . 2  Knowledge-Based  Systene 

An  alternative  strategy  would  be  to  develop 
knowledge-based  systems  to  assist  with 
planning  and  estimation.  Such  systems 
provide  the  following  advantages,  as  stated 
in  Vfetentan 

a)  Pentanent  resource  available. 

b)  Expertise  is  easy  to  transfer. 

c)  Consistent  procedures  used. 

d)  Affordable. 

Accordingly,  a  new  Integrated  Planning  Model 
(IPM)  is  proposed. 

ihe  model  uses  knowledge-based,  procedural 
and  statistical  techniques,  and  can  integrate 
with  existing  software  systems  (e.g., 
project  management  tools,  database  systems, 
spreadsheets  and  other  estimation  models ) . 

3.0  UM  Design  Overview 

The  architecrture  of  IPM  is  illustrated  in 
Figure  2.  Ihe  heart  of  the  system  is  the  "IBl 
Kernel , "  which  determines  the  recanmended 
work  effort. 

The  Knowledge-base  cxanponent  is  used  to  store 
facts  and  rules.  {Knowlulge-;>=-3ed 
are  made  up  of  a  knowledge-base  and  an 
inference  engine) .  Up-to-date  information  for 
a  given  domain  is  stored  in  the  knowledge¬ 
base.  The  inference  engine  is  responsible  for 
procressing  the  infornation  in  the  knowledge¬ 
base  and  cxjtiing  up  with  solutions.  (See 
Figure  1  for  more  details).  TVo  expert 


systems  reside  here  -  the  Monitoring  Systan 
and  the  Consul taticn  System. 

Ihe  Historic:al  Database  contains  actual  data 
from  previous  projects  (e.g.,  type  of 
project,  time  taken  to  cxmplete,  activities, 
resources  utilized).  The  Current  Database 
stores  information  about  existing  projects 


FIGURE  1.  STRUCTURE  OF  A  KNOWLEDGE-BASED  SYSTEM 

(projects  just  initiated  and  orojects  not  yet 
cxitpleted . )  On  ctnpletion  of  a  project  the 
actual  hours  are  transferred  to  the 
Historical  Database. 

The  Learning  ncdule  updates  the  knowledge¬ 
base  with  cturrent  infornation.  It  also  acts 
as  an  edit  window  to  remove  outdated 
information,  or  delete  incorrect  files  (that 
is  entire  projects  can  be  removed). 

Finally,  the  Scheduler  is  responsible  for 
receiving  the  reccitinended  estinates,  and 
generating  reports  and  clarts,  such  as  the 
Project  Schedule,  Project  Calendar,  Resource 
Table,  Critical  Path. 

3.1  DM  BxBc:utlQn 

Execution  of  IIM  involves  the  following 
phases! 

1.  Identifying  the  project  as  belonging  to  a 
particular  categony. 

This  decision  is  made  manually  by  the  project 
manger  on  the  basis  of  the  Planning  Document, 
Reejuirements  Analysis  and  estinated  number  of 
thexi  rands  of  Delivered  Source  Instructions 
(D6I),  For  instance  a  popular  mcdel  sucii  as 
Boehm’s  COCQMD  can  be  used^.  He  cxnsiders 
the  development  of  a  software  system  for  a 
company  that  has  determined  that  their 
program  will  have  roughly  32,000  DSI.  The 
following  eejuations  of  the  COCCMD  model  are 
used  to  estinate  important  claracteristics  of 
such  a  software  system. 

Effort:  MM  =  2.4  (32)exp  1.05  =  91  nan- 

months 

(One  itan-month  =152  hours  of  working  time) 

Schedule:  Estimated  Development  =  2.5  (91)exp 
0.38  =  14  months 

Average  Staffing:  91  man-rtonths/14  months  = 
6.5  Personnel 

On  the  basis  of  such  estimates,  classify  the 
project  as  belonging  to  either  a)  Snail 
Intermediate  b)  Intermediate  c)  Large. 
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Table  I : 


Subphase  1.3  User  Requirements 

a . 

1.3.1  set  Up  The  Project  13 

1.3.2  Review  Existing  System  15 

1.3.3  Interview  Users  3 

1.3.4  Document  User  Requirements  14 

1.3.5  Review  Document  with  Users  0 


2 .  Generation  of  a  Work  Break  Down 
Structure. 

The  next  step  involves  generation  of  the  Work 
Break  Down  Structure  (WBS) .  Ihis  structure  is 
a  hierarchy  that  identifies  all  the  end 
products.  All  project  and  development  work 
are  defined  here  as  activities.  IPM  asks  some 
more  questions  about  the  nature,  scope  and 
size  of  the  project  and  automatically 
provides  a  default  WBS.  Ihere  is  an  unique 
WBS  for  each  of  the  following  categories  of 
projects-  Small,  Intemediate  and  large.  (A 
very  reliable  WBS  can  be  generated  if  a 
related  project  can  be  identified  in  the 
Historic  Database).  If  the  WBS  supplies  by 
IPM  is  not  completely  satisfactory  it  has  to 
be  modified.  This  customization  is  inportant 
as  every  project  is  slightly  different  frcm 
another. 

Sample  WBS  generated  for  a  small  project  is 
shewn  be  lew: 

Subphase  1.3  User  Requirements 

1.3.1  Set  Up  The  Project 

1.3.2  Review  Existing  System: 

1.3.3  Interview  Users 

1.3.4  Document  User  Requirements 

1.3.5  Review  Document  with  Users 


3.  leading  the  WBS  with  Raw  Work  Effort. 

The  above  WBS  is  now  ready  for  loading  with 
raw  wrk  effort.  IPM  supplies  a  Suggested  Raw 
Work  Effort  (SRWE)  for  each  of  the  above 
activities.  This  is  the  simple  mean  of  the 
previous  estimates  for  a  comparable  project. 
If  data  for  any  particular  activity  does  not 
exist  in  the  database  the  SRWE  field  is 
simply  left  blank.  See  activity  1.3.5  for  an 
example. 


Subphase  1 . 3  User  Requiren*5nts 

SRWE 

( Hours ) 

1.3.1 

■Set  Up  The  Proje<rf, 

12 

1.3.2 

Review  Existing  S^fstem 

'5* 

1.3.3 

Interview  Users 

3 

1.3. 4 

1.3.5 

Document  User  Requirements 
Review  Doajment  with  Users 

13 

SRWE 
( Hours ) 


b. 

c . 

d. 

e  . 

Min 

Max 

Ave 

12 

15 

10 

12 

10 

15 

12 

2 

19 

25 

12 

2 

25 

15 

4 

4 

3 

2 

2 

4 

3 

12 

15 

11 

13 

11 

15 

13 

0 

0 

0 

0 

0 

0 

0 

As  indicated  above  SRWE  numbers  are  simple 
statistical  averages. 

If  the  sample  size  for  SRWE  is  less  than  5, 
or  if  the  variance  is  large,  the  SRWE  value 
will  be  flagged  with  an  asterisk,  (such  as 
in  1.3.2.)  Alternatively,  three  columns  can 
be  displayed,  the  Minimum  SRWE,  Average  SRWE 
and  Maximum  SRWE  as  illustrated  in  Table  I . 


4.  loading  Raw  Wbrk  Effort 

If  the  SRWE  values  appear  to  be 
unsatisfactory  they  may  c)iang^  coipletely. 
This  may  cone  about  after  a  consultation  with 
the  Knowledge-base  system.  Also  at  t.his 
stage  activities  with  null  SRWE's  are  riven 
an  estimated  value  (manually  by  the  Project 
{■tenager).  For  example,  1.3.5  is  given  a  value 
of  2  hours. 


Subphase  1.3  User  Requinaments  SRWE 

RWE 


(Hours) 

(Hours) 

12 

1.3.1 

Set  Up  The  Project 

1.3.2 

Review  Existing  Systsm 

15* 

1.3.3 

Interview  Users 

3 

1.3.4 

Document  User  Requirements 

13 

1.3.5 

Reviev.-  Document  with  Users 

2 

Assumptions  are  noted  down  in  the  database  as 
to  vhy  a  particular  RWE  value  was  given.  This 
will  come  in  handy  if  the  estimates  have  to 
be  revised  again  in  the  future. 


5.  Invoking  the  Monitoring  System: 

The  Monitoring  System  (which  is  an  expert 
system)  evaluates  the  estimated  hours  for  tlie 
entire  project.  It  can  also  make  selective 
analysis  of  the  different  phases  and 
subphases.  A  cemment  such  as  this  miight  occur 
after  the  initial  evaluation: 
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7.  Scheduler 


"  Hie  total  time  allocated  for  System  Testing 
is  too  lew.  The  following  are  historic 
iiaximum,  minimum  and  average  values  . . . 

The  following  rule  of  thumb  for  scheduling  a 
software  task  is  recommended  [Brooks]  1/3 
Planning,  1/6  Coding,  1/4  Cenponent  Test  and 
Early  System  Test  and  1/4  System  Testing.  You 
want  to  revise  the  estinated  figures  for 
Phase  4 . " 


Another  general  purpose  comment  such  as  tliis 
might  also  occur: 

"Application  Prototyping  is  strongly 
recanmended  for  snail  business  applications. 
This  has  to  be  done  in  Phase  2 . " 


6 .  lioading  Txtsk  Perf atmers  { Invoking  the 
Cixffiultaticn  System) : 

The  task  performers  have  to  be  assigned  at 
this  stage.  The  RWE  values  are  divided  and 
adjusted  according  to  their  individual 
skills.  The  Consultation  system  (expert 
system)  is  invoked  at  this  stage.  Unlike  the 
Monitoring  System  v^ch  can  be  classified  as 
a  general  purpose  system,  the  Consultation 
System  is  a  specialised  knowledge-base  based 
on  the  local  environment  (i.e.,  a  particular 
Data  Processing  shop.  Consequently,  each 
organization  has  to  have  its  own  consultation 
system.  This  system  best  functions  in  the 
interactive  mode.  The  follcwing  queries  can 
be  asked. 

"List  the  names  of  all  CICS  programmers." 

"If  John  Smith  were  to  do  the  coding,  how 
long  would  it  take  him?" 

"What  is  the  D.P.  shop  policy  on 
Doeuwentation? " 

Inforrotion  on  individual  software  packages 

can  also  be  queried.  For  instance  the 
following  infometion  may  be  helpful, 

'  FASTCCXiE  is  a  good  package  for  Report 
Writing,  it  can  generate  a  simple  report  in 
10  minutes;  it  has  a  qrxxi  training  iranual  and 
takns  appiroxiirately  6  hours  to  cxxpleto  tlie 
tutoi lal  "  . 

It  i.s  oh/ious  that  the  Consultation  System 
'vi.-j  act  as  a  g^xjtJ  training  tool  as  w».:-ll. 
ValuaMe  infoniBtion  can  extracted  f  njm 

'i:*'  kiifwhxigi ’-fase . 

It  i l■'/i  icTit.  fpxii  the  atr.r^c'  exaiiples  Uiat. 
th>-  '  '  i  r,it  ion  li^Lates  is  defierKlenL  on  the 
1  or:a  1  on'/ i  roniiifint  .  I,or-al  t  c-chii  i  fa  1 
ex[.<  r  i eiife  wi  tJi  krvv/lerige  i>nqin‘«'ring  ai>.i 
ie.sigi.ing  exfort  syst'ins  nmst  lo  available, 
otl.erwis*'.  Uiis  CTjqrjnent.  will  not  fe  uselul. 

tl.i.s  rh/ise  can  1“  f  riplet  ei  nupuall/ 
cis  i t.ly  ion<-' i  . 


At  this  stage  the  estimtes  are  passed  to  the 
Scheduler.  Several  project  itanagement  tools 
assist  with  scheduling.  Based  on  the 
activity  network  coipletion  dates  for  all  the 
activities  are  alloted. 


8.  Iteration 

Needless  to  state,  several  iterations  of  the 
above  will  occur  during  the  life  of  a 
project.  The  first  estimates  will  be  re'/ised 
repeatedly  as  the  project  progresses. 


3.2  Laming  in  rM 

The  term  "learn"  is  used  here  in  the  sense 
that  the  IPM  knowledge-base  expands  to 
accontredate  additional  data  and  information. 
The  following  exanples  serve  to  illustrate 
the  point. 


3.2.1  Adding  Structured  Infcrmatian 

Infomation  added  to  the  kncwledge-base  can 
be  structured  or  unstructured.  When  the 
infonration  i  structured  it  is  possible  to 
use  the  inforration  dur.iny  estimottion. 

For  example,  if  a  project  iranager  acquires 
inforration  about  a  new  "faster  and  user- 
friendly"  software  package  for  application 
prototyping,  it  would  be  helpful  if  the 
knowledge-base  had  access  to  this  new 
information.  This  can  be  implarented  as 
follows  -  consider  this  dialog 

Do  you  want  to  add  new  infornation  to  the 
knowledge  base? 

>>Yes 

To  what  database  must  this  new  infonration  be 
added  to? 

>>Software  Packages 

Please  enter  the  following  infornation 

Nane  of  package:  6GL 
Price:  650.00 

Average  Time  taken  to  cempiete  tutorial :  4 

Hours 

F.st  lira  ted  Productivity  Factor  (out  of  10):  7 
Additionirl  Intonration  aliout  the  software: 

Wfien  ali  tlie  qunstion.s  have  been  answered  the 
atxive  i  nf  orjiuit.  ion  is  in-seided  into  the 
dataiase  and  it  will  te  considered  when  a  new 
srli'dulo  is  goneratixi. 

3.2.2  Adding  Unstrufrturrrd  InfornHticri 

Unstmirtunxi  intoinvition  can  als‘)  to  addrvi  to 
Uk  ■  kiK  w  1 1 1  ige-i  ase . 

'lliis  ;s  simj  ly  a  ntite-jad  or  a  mine  Indd. 
Consider  tile  .‘'ll'wifKj  njirn'nt  s  atinu  the 

i”  tc  Stii ■: 
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(Csn  invoking  the  Learning  Module) 

Do  you  have  any  comments  about  this 
Activity? 

»  I  have  experienced  considerable  delay  in 
obtaining  printed  forms  from  the  PrijiL  Shop. 
This  has  considerably  delayed  the 
development  process.  It  would  help  future 
projects  if  printing  requisitioning  is  done 
in  the  preceding  phase. 

Such  cotments  can  be  quite  useful  to  a  novice 

Project  Manager  reviewing  historical  files. 


4 . 0  Oonclusicn 

The  automated  approach  to  planning  and 
estimating  will  provido  benefits  only  if  the 
following  criteria  are  satisfied: 

1.  Projects  are  consistent  in  size  and  scope. 

2 .  Actual  historical  data  are  stored  in  the 
Historical  Database  (ie.,  the  ccmpleted 
estimates  not  the  initial  estinetes . j 

3.  Sincere  attempt  is  made  to  document  actual 
experience  ( such  as  the  preceeding  exairple 
about  Print  Shoop  delays)  into  the  Monitoring 
and  Consultation  systems . ) 

Finally,  it  appears  that  such  an  autcreted 
system  can  also  be  a  front  end  module  in  the 
Computer  Assisted  Software  Engineering 
(CASE)  system  architecture.  Integration  of 
management  and  development  tools  is  one  of 
the  main  goals  of  the  CASE  system 
architecture. 
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Abstract 

(’[ii-s  p:ip'T  u\pi.)rrs  a  iiuniliur  "f  diffpruiii  f.v.ls  (<.r  aidm^A 
l!iu  niudica!  diaj^nnst  ic  p^^*«^•s^  in  ihu  (ioiuain  of  npiuic 
t'lhu  Mut  h<  Iroiii  til'  Miacliitir  Icarnuii;  anil 

kn-'wlud'AP  acipiiMtiMti  wcirld  an-  luanparcd  tn  in<>r»*  cla^si- 
cai  statist  lual  t  ct  h  n  Kpius  ami  rcas-.as  arc  y;ivp|i  wliv  tliuM' 
riiuilu'dN  can  <vnri()lcfn<'iit  ami  cfiliancc  the  diaj^ncsis  \ 
variati'.ni  on  a  \\civ;!it  o\'  uvidcn-i'  f.nimila  dn«-  i-.  i 
whi'di  iisi's  tile  ii<jiioii  ,  if  iliu  pf  dudulity  of  a  fu//v  uvuni. 
IS  intro.iucud  and  ^'niic  initial  rcMilts  ar*-  prrs*‘ntud. 

1.  Introduction 

I  hm  pap'T  (iis(  ii'<s''s,  Otu  l»r*>ma'S>  to  dale  on  a  lar^*' 
project  iind-rtak'-n  pritn\ri!y  ai  the  linvcrsiiy  ..f 
(lUulj)li.  wiinli  i>  ill*'  li'iiiiu  i*>  *'^nc  •'!  (  anadaV-  major 
(ami  few  I  X'eterinary  Collie;, .v  '('he  lan;er  ^oal  «)f  tin- 
project  i>  i*.  Ijiiild  a  general  diagiiostn-  for  velen- 

iiarv  medicine  Work  lias  fiegnn  a  (ip vf-sc ^'pe  tfie«i/e- 
iiig  the  diagiiosi'h  ot  surgical  versns  im-dnal  colic  m 
liorscs  1  hi."  }."  a  smndii'aut  proltlcni  m  veterinary  ine<li- 
‘  ine  having  led  to  many  stiidn  ,  use  of  diagnostic  <liar«^ 
etc  to  aid  o\sfiers  and  v et t.j- m ;i i aatis  m  reeogni/ing  sen* 
oils  ca>e>  tj.<  21  |{ors<''«.  siiNpeited  . d  reipiiruig  snrg<-rv 

riiiist  lie  shlj'p'd  at  a  signilieaiit  e*)sl  to  the  veleniiaiv 
hospital  wliere  fnrtfier  ie^i>  are  eondnci.Ml  and  a  (inal 
del  I'l'-n  is  made  W*'  are  endeavoring  t<»  piovide  a  com- 
piit.-(r/''d  diagfi' lo.d  for  ()■,(■  m  the  hospital  :i>  ue!l  a> 

' 'h  a  r-fit'ife  ae-,-w>  ha.sis  p.r  prmiiumg  veterinarians 

111'-  animal  liospjial  at  (dielph  ha.s  a  computer  sv-*- 
’’  IM  'all  \  \!1\IS  (\etetinary  Medieal  Information 
\la n ug'-iii'  ii t  S\>temj  I  his  ■-ys!*-iii  was  onginailv  pro- 

-1  imiii'-d  I!,  >harp  AIM.  and  handles  U'-Iial  a*!  mis.-'iori  ali<l 
killiiii:  jiP'*  .•-lur«"-  lIou.-v.T,  the  s\v|,.,ii  al-o  stoi*--.  a 

ofi’.|de[  atih-  anio'int  >  •!  im-<li'al  in)ormati..n  <'ii  each 
i'a  I  u'lt  t  I  !i '  hi  d  in  g  luK  1  -  •M'  1  "..v  ‘lineal  p.il  liof  .gv  ,  pa  rase 

'"h'gv  radi''l''gv  and  'Mi'-r  ;■  ti-nt  njf'-rm.irion  sni-fi  a.s 

'• '>  i'i"'d  i  I  ‘  '  a  >  i  :i 'j  •  •mpliint  ir'aimeni  pio. 
■'■i'::---  n.oc  '■  '  ■■■■-■  ■■■/'■  hr  flmeal  paile-h-g’. 

n,  ,  h  :  '  :e  ;  ^  i,.  dl-.  e.-h*-r  ii.'d  l-v  th*-  I  tl. 

'  :  ■  ■  ■  1  -1  c  !!■  i,:l\  ledd-.  ..l-.tif  tIOtl 

. . .  SI.;  IS-  ',()()  m-  ^  ; 

' • .  '  •  ; .  -d  o  a  I  j .  1  1,..  ■  i; M *  m  pi.  ,j*--- 1  |.v  1..  11, .• 

I-'  '  1  '  IM  I--'  ■  ’-h'-  Hie'Mtii"-  '  ■!  ••n-liii'-  ‘l.!t.» 

’  ■  '  '  M,  '  1  :  f  L  II  I'  )  I  '  "  (  )l  Ij'  I  d  it  d  i  I-*-  m-.ll*--'  ■  •  -  M 

’  h'  I'll  J  h'M.‘  I,  '  1 1  r  ■[,  o{  oi)  I  a  -  III  ap-  ll  O  -  j 

■  ,  1  '  ,  N!  '  '  d.  lie!  \  I  .1.  /  in  Id  \ 

noil's  't  ?  'u  ;  :  [  '  I  '•  f  •  m  I  -  c  n  »-n  <  i.  J  -i 

!  i  :  I  -  I  M '  ■.  '  ■  m  hi'..-  h. .  i,  sh.i.'i  d-  V  ■  i-  ; 


(lUt'lpfi.  (hihirio 

seen  m  major  proj<-cfs  like  NnCdN  j.'*'!  .ajlh'jd  )  d  ddie-.e 
re<-eiit  systimis  lia\*‘  heen  largely  based  ‘ill  the  assiimp- 
tioii  that  to  have  expert  (apaliility.  they  must  somehow 
mimn  the  behavior  of  experts  Isarlier  work  using 

mat  liemai  icai  lormalisiiis  ( decise  .ii  analysis,  j'a 1 1  ern 
matching,  eic  |  were  largely  disrardol  and  atii  nii  'ii  lo 
the  study  oj'  tile  actual  probjetn-sol  v  i  n  g  behavior  .  d 
I'Xperieiiced  cjini'ians.  fli  a  r*c<'nf  f)a(HT  21  l/l  Drs 
Pittd.  Szolovils  and  Suiiwart/..  Il  Is  suggested  that  t(i<' 
time  fias  <-ome  fo  link  the  old  witJi  liie  new  "now  that 
mneli  of  ilie  ,\  1  eomiunniiy  ha^  turned  to  ca>ual,  path'»- 
P'hysiologic  reasoiiiiig.  it  ha.'^  lie.-oine  a|'parciit  that  some 
of  tli<‘  earlier,  ilisisirded  strategies  mav  have  imj'oftani 
value  HI  enhammig  the  peitorm,im'e  of  new  programs 
M'lie  authors  recogn i/.i-  ihe  dilFi'ully  of  this  appresudi 
when  they  state  ihai  “an  extensive  rescaridi  clloft  i» 
r<'<|uire<|  before  all  th'‘se  leihnirpies  ca/i  b<'  incorporai *-d 
into  a  single  pr<.>gram 

W*’  are  experimenting  with  the  us*-  si a j /,si n  ;i  1 
t‘'chnnpies  and  at  the  saim'  timi'  developing  a  rule-based 
sy.stem  Irojji  <-xpert  opinions  d'hts  paper  is  pnmardv 
eoma-rned  with  fimiing  the  right  tools  to  analyze  tlm 

data  before  coinjiatiug  sindi  r«‘sulis  with  the  "e\p*'rt" 
rules  eomp«*nenl  SecIloU  2  1  discusses  the  U^'-  of 
dis.-rimmant  analysis  and  logistu-  regressi  -n  Sction  22 
looks  ;,t  a  metho.i  r.j'  IbiVesiaii  f  I  a.>:  U  f. .  t  1"  U  and  si'.'ti  UI 
2 -i  e.ihsiders  an  induutive  leaiaing  te.  linnjue  fine  to  tli*- 
•b‘d  author  ami  anofh'T  meiho)  ilue  to  H  ()innlan  1'' 
al.''o  ansilig  from  the  niachim-  i“arnitig  community  Sec- 
t:'^n  d  intro, {i|c*s  a  method  UMUg  fu//v  s*-!>  whnli  ah.' 
ill''- «rporat .-s  -i  wen^hi  of  , \i,!eiir.'  formnhi  dm-  |o  1 

In  ’Sfiti'.ti  .i  1,  :,!1  .'X  P  L'lveii  atid  tile 

etf,M  t  |  \  elie  -  -  -  •!  t  11*'  \  ,l  1  !■  U  -  tll'  t  le  -ds  .  olu  pa  fed  (  dnaiiv 
I  li  Igh!  igh  t  Wi'ik  wht'h  1'-  philiiied  alrl  is 
■•uri'-nt  1\  in  pr*  •gr*'-'' 

2.1  CMassical  Statistical  Miithocls 

dl.e  d-lt.l  im-d  for  ll|i--e  '-tlldu-  re[  r-  -.rllt.-d  'J.')d 

h-  .r--.o  pi  .•'..•Il  t  'd  >  t  I  Im  t '  u  h  in  h  h'  ‘"pi'  .il  ;i'  t  lu '  I]  'h  1  Im 

hois.-w  w.-o-  .ill  oii.j.-o.i  i'.  lii*.  -.aim’  'hiU'.tl  test-,  and 

I  !;*•  '.  ini'-  p  it !('  -i'  .L  V  I  it  1  vv  t .  'll--.  I  ,-d  1  h  !•-  d  ,il  a  •'<'t 
w  IS  n-v.  •!  ii  .i  ill  f  h*-  -I  ii'  I  <h-,  I}"-*  d  in  this  p.ip-r  ()iit- 

ml'  ‘l  lii.iM'  'll  ■  *  t  t  ■■  I  -  '1 1'  'W  1  h  g  1  \  j'.'  vv  .i"  .iV.illd'i-- 

vv  h.  ?  .  -'I ;  ^  tv  vv.  I  ••  I.'  I  '•  '1  [■  I'  i  w  h  ••’!;-  1  ■  I  ii'  'I  I  "  a  :  gi- 

'  .i!  1-  O-  W  .e  t  '1  idv  I  O.  l  111  i  :  h'-  till  j1  -^t  tie  ,  ■!  I  h' 

I I  m  I  j  '  I  '■  1  -v  ■  1  '  i  ■  ;  ■  !  '  I '  1  .1  Ill  ■’  1  111'  ■  h  I  1 V  '■  ■  ■! 

ill-  't  idv  w  U-  ■  .  ■  e.  a.'  h  \  It;  it'l'-'-  .  i  t  u:i.  d  hirriL, 

-  X  1 1; ;  If!  ,  ‘  ;  i>  •  it-  h  :  •  '  vv  I '  il  1  i  1  ’  !li  U;  .1  1  l  ..  Wi  v\  •  o  ■ 

-1,-  i!  ii  -  ,  !|  !  ,11  !  -t!'  I  ■■  M  1  ,  I  ,  r  !  .  I  I;  -  I  *!  1 1  :  '  I'l  II  ■  ■  I 
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temperature,  heart  rate,  respiratory  rate,  temperature  of 
extremities,  colour  of  rnueus  ffiemhranes.  cafiillary 
times.  j)reseiice  and  severity  of  abdominal  pain.  al)tloini- 
nal  distension,  peristalsis,  and  the  results  of  naso-ga-si rir 
intubation  and  rectal  examina.ion  Clin'co-palholo^ical 
parameters  <‘valiiale<l  included  hematocrit  (IK’d')  and  thi* 
tula!  plasma  concent  rat  ion  of  al)d(jminal  Iluid.  'I'hese 
varial)les  were  sometimes  continuous  and  when  des<-ri()* 
live  (pain  levels,  etc.)  were  translated  into  discrei<' 
intei^er  variables.  Missing  data  was  handled  by  elinnna- 
tioii  of  cases.  'Phere  are  20  parameters  in  each  of  the 
twn  data  sets  {rlin  and  i!ift~pnfh} 

.\  multiple  stepwise  discriminant  analvsis  in  a 
recursive  partition  model  wac  used  to  determttie  a  deci¬ 
sion  pr<ito<'ol.  '[’lie  deeisinn  was  validated  by  a 

jackknife  elassihealmn  arid  alsn  l)\  evaluation  with  refer¬ 
ral  population  m  wlneh  the  prevalence  of  surgical 
patients  was  01^7  c  f  b.  7  .  '1  he  signilicant  parameters 
were  fcjiind  to  be  abdoniinal  |ain.  distension  and  to  a 
lesser  extent,  the  cofir  (d  abd  >n)inal  Hind.  'I'lie  use  of 
the  decision  tree  yielded  a  significant  number  of  false 
positives  and  virtually  elimiria’ed  false  iK'gatives  in  one 
study.  I  nne<-essary  surgery  is  even  more  undesirable  in 
animals  than  humans  due  to  costs  (usually  ()«>rne  i>y  the 
-juner)  and  th<‘  debilpaling  elferts  of  surgery  <>n  a  pro¬ 
ductive  animal  Other  dilfi<-ul!  les  with  these  results  <'on- 
eerned  the  fact  that  the  elmical  juithology  data  appeared 
entirely  non-pred itt iv“  -  a  resiilt  contrary  to  the  tuedical 
l;elief  tliaf.  at  least  iti  serious  ca.ses,  certain  '■'f  iliese 
ineiusured  [jaramelers  d"  change  signilicantly.  Discrim¬ 
inant  analysis  can  miss  effeets  whrii  variables  ar<*  iK)t 
linearly  behaved  Missing  data  was  another  serious 

|trob!em.  Other  rtmtiiods  des-ribed  iii  section  d  Indped 
i  iVercotne  some  -  -f  1  hese  I'fobhuils 

L'jgistic  Regression  11  V  a.s  also  run  on  tlie  same 
data  set.  Here,  a  regressi(.tn  nio  h  i  (A'  the  lorm 

/', 

i;  ^  ___  ...  .i,  -  4-  ■  .UW 

("  ({>''d  ;Ufer<'  the  ■'f,  >  are  "lopr  parameters  relating  «-:ich 
"f  til'-  .V,  independent  variable-.  the  l',‘s  (log  od{l  > 
ratr.)  .\ppropriate  t  ra  n-foriiia  ( ton*'  were  ma<l<'  to 

aer.aini  lor  nominal  flala  The  data  \sa.s  run  all 

three  |lo^-^lbl!il  les  surgical  h'sioii  found  (SI. I.  siirgc-ry 
peitor-fjird  (S|  and  oiitconie  (O)  din-  oiil't'iiie  O.  can 
t.ike  ..II  the  values  , lived,  died  or  '•iiihani/ed  Ruls«*, 

ob-'t'-tisi-'n  ate]  .1  v.inab]'’  repi''''ent  in.g  the  presence  .  .| 
tirm  lere..  |n  th-  lafuL--  intr-Iin.'  i.\2|  \^eI•e  t  he  Mg|uhe;,nl 

pr.'diet..rs  n.ovevei,  fe>f|ng  azainsf  uliether  fhedoelof-s 
ilernlerl  to  (j.,  sin^.-pV.  Jeilll  alld  \'2  were  the  llioSi 
slgnifnallf  ('I'lie-r  results  VSep-  •ditallied  floMl  ill''  'dini- 

'■al  data  only  )  Out'-mie  l.aind  several  oih>’i  vartabi‘'s  o. 
la-  signilicant  the  probabilit  v  o|  .leath  is  mepeax-d  bv 
pain,  'old  e\i  f  efjr  It  ;>'s  ,i  .high  p.eked  cell  volume  and  I- -u 
NOO  ['•ading  ( Na.so- g.i.si  nc  tube  enussioijsl  .\gain 
pP'Id'Otis  uepe  .'aUs.'d  1 1  \'  Ml'-  (ires.'ficf  . .(  ffHss|ng 
^■Ulie  III'  'dlfl*  at  |.  .ti'  'f  lh'’s>'  I'-siill-  \Ser'-  b-und  when 
rill  -ing  data  U'le  e-.ti(uater|  |••In^  s|  Iegl.--s|.  .11 

Il.-tli'-ds  '1  Ic  ll-  I'T  t{|<’  Ilils-ltig  dal. I  \\ere  o,|,-,pil- 

at"d  and  vaii'cr-  -  ,ni  -  (  i  n  (--s  r-''''rds  eiel»'d  b«-',,i|-. 
>■!;  f !)..  n  f'la  'A  IS  'h''''ai  t.-  a  -oIuli'Mi  Pt  '  - -st  pea.s.  •n'' 

Tliu-  '-fuiiiting  liio-KU  d.ir.*  -u-  Iff  :i  vef\  fel,,b|e 
f hnepi'- 


Although  f>ne  can  change  the  levels  of  sigiiiheatiee 
waii  used),  the  regre.ssion  models  assume  a  iiuxlel  of 
behaviour  of  the  variables  (here,  linear)  Anotlier  f;rcdc 
leiii  of  the  results  of  the  two  analyzes  presented  concerns 
tlie  relianc!*  on  tiie  measurements  of  ;:ain  'O’d  type  r.f 
abdoniinal  disteiisicm  'Phese  variables  are  extremely 
subjective  and  carry  a  very  large  measure  of  error  Mm  h 
other  medical  data  is  collecteri  on  the  eases  (d  a  more 
precise  nature.  Some  of  tfic  e.ihcr  methods  to  be  dis¬ 
cussed  mak<’  more  u.se  of  these  other  jni'asu remen t s  in 
fortnulaiing  their  diagnosis. 

2.2  Bayesian  Classification 

This  methodology  u.se.s  Hayes  d’h.eoieiii  to  discDver 
an  optimal  set  of  (dasses  for  a  given  si'l  ol  (‘xani[)les 
d'liese  classes  can  be  used  to  make  predictions  or  give 
insight  into  patterns  lliat  occur  in  a  pAriicu!'‘r  domain 
I'lilikc  many  clustering  techniques,  there  is  no  need  to 
specify  a  "similarity"  or  "distance"  measure  f>r  the 
nuinl)er  <>{  classes  in  advance.  'I'he  approach  luTc  finrls 
the  probable  (da.ssilicat ion  given  the  data  It  allows 

for  bot  h  categc»ry-valued  informal  ion  and  real-valued 
infonnation.  For  further  detaihs  on  the  theory,  the 
reader  is  referr«‘d  t.,»  l.a.'J'J 

.•\  pr<jgrain  <'all<'(j  Auio<d:Lss  I  (sec  als<,.  a  )  was  run 
on  the  conibutcd  clinical  and  pat  liol<>gical  data  sets  .Ml 
•'>1  Variables  Were  in<iu<ied.  ihl.I  is.  all  outcome  possibili¬ 
ties.  as  describ<'d  m  .section  2.  were  imduded  its  variables 
and  lesi«>n  ly[>e  (fotir  jx.ssiImI  !  u  .'ll  was  also  addiai  .\ 
tc.ital  of  \:{  <  (a.-^ses  were  fotjud  and  in  most  cases  (he  pro¬ 
babilities  of  a  liorsi'  behuigiiig  to  a  class  were  1.00  'Phe 
type  of  information  available  eonsistecl  of  relative 
influeiua*  values  for  every  attributi'  t'.)  the  (»ver-a!! 
<dassilicat i* an .  for  <*a(  li  <  la.ss  and  each  attribute, 
mlltjcncc  values  are  produced  mdieating  the  relative 
intluen<a‘  <■)(  ea'h  atifibutt’  to  thal  idass  1  his 

informaiioii  ts  available  m  t.ibula.-  and  giapliieal  fonn, 

'Phe  <da.s.scs  };roVi(}e  rebat'd  groups  of  eases  whiih 
ari‘  useful  for  ease  stu<lies  |()\(’  is  also  a  le.u  hllig  lllsti- 
tutnuil  'I'he  iidV.rmation  may  be  used  predict ivfdy  For 
example,  the  <lass  with  the  highest  norinali/ial  weight, 
class  0.  was  bu.ind  to  have  surgical  lesion  a-,  a  very  high 
mdueiire  factor  d'he  varialdes  t.f  abdomin.il  <lfsirnsiori. 
pulse,  alxlotlien  (e'tntamilig  \2  ineiilloiied  earlier]  and 
(Clin,  b.uind  signilicant  by  earlier  meihod*'  were  als.. 

inlluetitial  factors  p.r  this  <  la.ss  Some  oihep  variables 
Hot  (lagged  bv  earlier  melh.*ds  ware  lound  !■'  f.e 

mlliH-ntlal  as  w»dl  (total  |>roi,ii,  h  ve^  and  .i b. lom i noren- 
te-<is.  Ill  |iarli<'u|ar)  'I'he  li.ir--e''  m  fiasco  wre  ioimd  i.. 

tft  have  >lirgl'-ai  !e-.|o|m  It  m  thus  possible  to  -.ee  ifom 

the  (eatures  of  !ior--es  111  thm  '  i  iss  whe  h  attributes  ,i|,| 
what  type  o)  atlnloHe  Values  ,re  sign  l  (l  e;t  ll  t  f-T  tilt''  to 
b-e  t)|e  ease  New  ease-,  r.jii  b:-  '  ategofl/ed  lll'o  i  bis'-e*- 
■leeordiiig  to  tfieir  adribufe  values  and  if  b-und  i"  )''■  lu 

.lass  0.  thm  Would  ludi  ale  ,i  veiv  --iii.nl  ehaliee  .  >f  sur¬ 

gery  ie-ing  re«pjire<i  f'lass  1,  IioVVVer  IS  pi  e.(.  Uil  in  a  n  t  Iv 
.1  ‘  lass  w  lier«-  su  r genes  ii  i-  r  e  pi  ired 

.X'tuallv,  there  i>  a  vv'.dili  ■  iiiforiiiat  r'li  t..  |.. 
gl'ifie.j  fr-'in  the  lesijp-  m'.'<)i  o|  Hit  e|  pr.t  IV 

Wot  k  is  .  .hg‘  uiig  ,i!  tins  tune  (  bi-'  in  i  v  iii  1  -  r  t  h.i  I  •  ert ni 

valldde--  ale  li-a  ver-v  podr  (|\.'  fo  r  evafllpl'-  1  V  -ri 
aid.-  d  nt  iiigu  mh  lug  \-e:iig  Iciii  oM  iniin.d-'  lit.-  f  vv 
in  !i  ueljt  e  \  a  1  Ije  ]||  .ill  Ill  I J  -  -1  ■  1  i.'-'-'o  Put  l-  -!  |g!i  t  I\  lie  'I  ’ 
'Ig  n  ill  '  .Hi  (  111  t  •  olj  p  >t  J  .  -  t  I  I  I  1  .e  -  I  .1  .i.ss.  -  ‘‘'tli'lvillg 

these  small  f  lasses  liow.o.-r  <  m  !)•■  p  o  f  i  ■  u  1  a  rlv  las.  luat- 
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ing  because  they  flag  situations  which  are  more  unusual. 
That  is,  methods  which  simply  find  variables  which  are 
usually  the  most  predictive,  cai.not  perform  well  on  cases 
which  do  not  conform  to  thi  normal  pattern.  Other 
classes  pinpoint  cases  difficult  tr  diagnose  Class  12  con¬ 
tains  three  cases  which  were  all  operated  on,  but  only 
one  out  of  three  was  actually  found  to  have  a  surgical 
lesion.  Two  cases  had  a  simple  large  colon  obturation 
and  the  third  a  large  colon  volvulus  or  torsion  (requiring 
surgery).  That  is,  two  unnece.ssary  surgeries  were  per¬ 
formed.  .'\  close  study  of  the  influential  variables  and 
their  parameter  values  for  this  cliuss  of  cases  -  with  very 
close  but  importantly  different  diagnoses  -  provides 
extremely  valuable  information. 

.•\  significant  question  arises  as  to  whether  or  not 
outcome  information  should  be  included  initially  in  the 
program  run.  If  the  data  were  highly  predictive,  it 
should  not  matter.  However,  it  was  not  clear  from  the 
onset  how  true  this  was.  The  data  was  run  twice  again: 
once  with  all  outcome  infoni  ation  removed  and  oiiee 
with  the  doctor's  decision  infoirnation  deleted.  When  all 
outcome  information  was  removed,  one  notes  that  .some 
interesting  prognosis  information  still  remains.  Class  I 
indicated  cases  virtually  all  of  which  lived  and  whose 
condition,  whether  or  not  .surgery  was  required,  was  gen¬ 
erally  good.  The  question  often  arises  whether  or  not  to 
operate  even  given  that  tlie  a  limal  has  a  lesion  if  the 
general  progiussis  is  bad.  Thi.s  chuss  would  indicate  that 
surgery  shotihl  be  (lerl'ormed  in  such  eiuses.  Class  2  (12 
cases)  was  extremely  well  discrimiiiated  with  having 

a  surgical  lesion  Of  the  remaining  classes  (again  there 
were  Id  in  total).  :i  1  were  also  reasoiialily  well  discrim¬ 
inated  on  the  basis  of  lesion  Others  flagged  items  such 
as  young  animals  or  cases  that  had  very  poor  over-all 
prognosis. 

New  data  has  been  obtaineci  and  code  tvritteti  to 
lake  new  eases  and  determine  a  (irobabilily  distribution, 
for  'Ills  ease  over  the  ela.sses.  fiom  which  a  probability  of 
outcome  may  be  lalriilated  It  is  interesting  to  note  that 
the  use  of  nayesiaii  cla.ssilieation  for  medical  diagnosis  in 
this  fashion  is  in  a  sense  a  mathematical  model  of  the 
menial  proees.-,  ilm  elinieialis  iheiiiselves  use.  That  is, 
lliey  iry  1"  think  o)  similar  e:e  r-s  and  what  hapjiciied  to 
lii'.se  eases  III  making  predii  lions  The  leehniqne  for 
iiiakiiig  predi'iioris  oiiilmed  aiiove  prewides  a  sophisti- 
'  ated  aiilomalion  of  ihis  |iroeess  |i  is  mipo.ssible  wilhin 
ihe  sri,pe  of  this  paper  to  d<.riimeni  all  the  inrormalion 
obiaimd  from  U'lng  Ha\es|aii  indmlive  mfereiiee 
'.iiiiple  ease  will  lie  provided  in  Section  d  1  which  will 
show  tlie  usefulness  of  coniliining  inlV^rmation  from  the 
results  o|'  liayesian  inductive  inference  wilh  the  euher 
mel  liodf .)'  tgies 

2..d  Other  Induction  Techniques 

The  proba  bill- 1  le  learning  sssieni  deiidopid  by  iln- 

iliiid  (i)Itior  '  f  1 ')  2ti  .il'o  1  la.ssihes  ilal.i  ( ‘l.isses  ao- 

■|';:nil!'.  di-' fimma'ed  .ee,  .j  .jmg  an  mdii'iive  en- 
lerioii  wlmh  I'  esseiiiiaili  111  lormai  loll- 1  heorei  le  I  o 
t' o  .iniiiodiie  dwiaiiii'  :iiid  iiiieer'iiii  le.iiniriu  I'lsj 

|ej,|.  .,|||,  o,m,.|,i-  I.., Ill  ,|.S  pj.  I,.|',|.  ,i|.|  ti\|.eir..-. 
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lli  ,|  II-  1  I-. -■  MlO-ri-'ll  He  -  -I  p- -rate  b- -1  tl  pr-t-.d-illU  .in-l  ll- 

-'I !  -  .1  ')'- 1  I  i!if  ti '  1 II  !  nil-  I-  nunc  -  i  t-----  -  -I  pi  -  -  I- 1 1  il 
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Two  classes  of  data  were  given  to  the  program  (1: 
no  surgery,  2:surgery)  and  14  clinical  pathology  variables 
were  processed  according  to  the  PLSl  algorithm.  Ihe 
results  can  be  interpreted  as  rules  and  al.so  flag  the  most 
prominent  variables.  The  current  version  of  PLSl 
required  that  the  data  be  sealed  between  0  and  2-d.S  1  he 

variable  Xl  represents  total  cell  numbers,  X8  is 
mesothelial  cells  and  Xl3  is  inflammation.  These  were 
found  to  be  the  most  significant  variables  for  the  predic¬ 
tion  of  no  surgery.  Por  the  purpose  of  predicting  sur¬ 
gery.  again  Xl  and  Xld  were  significant,  as  well  as  X9,  a 
measure  of  degenerate  cells.  I'neertain  rules  can  also  be 
obtained  from  the  results.  Further  work  is  being  done  to 
revLse  the  learning  algorithm  to  be  able  to  handle  iiii.ssing 
data  without  losing  entire  eases  (i.e,  all  parameter  values 
when  only  one  is  absent).  The  statistical  methoiis  lose 
all  case  information  and  it  would  be  a  considerable 
advantage  to  employ  a  metli  id  less  sensitive  to  tins 
problem 

This  tecliniqiie  produces  diagnostic  rules  of  the 

form: 

Class  1. 

[  1  <  xl  <  1  ]  [0  <  x8  <  2f>  ] 

[0  <  xld  0  ],  p  With  a  utility  value  of  .8ddd 

Class  2. 

(  2  <  X I  <  2-V>  ]  [  1  <  x'l  <  2-V.  1 

[1  <  xld  '  2'>’>  ].  with  a  utility  value  of  .S-l(i2, 

.-Vctually,  for  each  chuss  there  are  several  rules  w'itli 
different  utility  functions  very  low  ulilily  fiiiiclioii 
value  would  indicale  more  probable  membership  in  the 
other  chuss.  For  the  cxainjde  given,  not  abnormal  values 
of  xl  and  x8  coinbincd  with  no  inflammation  is  indica¬ 
tive  of  a  meuicai  -olic  ot  level  .8J,  whereas  a  higher  cell 
count  (more  abnormal),  togcthi-r  with  some  inflammation 
and  ihe  presence  of  degenerate  cells  indicates  a  surgical 
case  at  a  level  of  K.').  Keeping  aO  cases  aside  for  lest 
pnrpos(-s.  the  p(-iforiiianee  of  this  system  lor  outcome 
llivi-d.  died  <ir  eul  h an ized  )  w  ;lS  7  I' r  eorreci  diagnoses 
For  ihe  predicl  1- -II  of  l<-sion  -nr,  lesion  th<-  resiill  was 
til'/  Ihiwever.  tfie  initial  Irailillig  set  wius  not  parllcii- 
larlv  iarg-'  when  missing  data  was  laken  iiilr-  acroiinl 
and  w'-  havr-  n-'U  jiisl  coniplele'l  the  ii-lleclion  I'-l  new 
1  - -nipal  il-h-  data  wlinli  i--  being  K-sli-d  The  iiiilial  pi-r- 
I'-iIiiaiir’e  for  - -n  1 -a -]|H‘  Was  lieller  llian  tht  (lor-iors  pr'-d- 
i'ti--n'-  an-l  for  h-sii.n  -w  li---i->ii  was  lower  (liv  aboin  8',  ) 

Oninhin'-'  aft' ii  It  hill  18  f- -r  ril le  --xl  rar  1  i-ni  has  lic-ii 
mil  -'ll  the  'hni'.il  rl.iia  ( ’>  uii  in  w  ui-  d.ila  has  b'-i-ii  -oli- 
■,--it--il  |o  di'.i'ii-Ii-  v.iln-'s  'I  III-  iinporl:inl  vanal-lf'  ar-- 
ni'i-  n-  iiH-iid-raiH---,  p-  loi.il'-ia  ia-t-i,il  t --niper-il -i  r--.  [-a-  k---! 
-■*■11  \--liinii'  pniat-  .il -lb -in  111  .il  'liHliiioii  aiid  ii  iv-i -gasi :  ir 
ii-llnN  -V  niinil-T  - -I  v.iriabli-  ippr.iri--l  'igii  ili-ai nl  ii'-ihg 
ihi-  Il  ■  Ihii-iii--  wlii'li  WI-JI-  ilii-nie'l  iiiiinip- Tl  :iiit  ii--iiii; 

I  i .--- M  Ml  Hi  :i  n  I  :tiidv-i-  -b-M-i-'ii  till-  w  .i.a  gi-nt-r  1 1  i-'l  h'li 

■  iiH-  i,.i|-!'-nii-ii'  ii  I- -Il  al  lillii  nliii--  dill-  lo  iIh-  ihmiiI'iT  --I 
.11  I  |||||  -  ||-|-||  I  II- 1  11,1— I  al  1  h -  I  I  -  -  a  I  I--  w  li  l-  li  w  a.- 
■i-  I  -.  I-I  \  ■  -III  I  |.  I  I  I  1 1  1 1  a  I  !-  I  Ih-'-i-  I  I  -  .|,|i-|n-  HI- 

.  Ii;  1  ■  Il  I  !  ■  H-V  I-Mo  --  I  I  -  ■  .  i-  I '  1-  -p  .1  111-  1--  I  -  -l-ii-t  i-r- 

-I,  ■  -I  I  b-  lU-  I  H  Ion  lb-  iii-ih  -linlii'  I  'll'-wiiii:-''- 

I  ■■  .[i  11  -I  -  ■  1 II  V  1 1 1  1 1  -1-  -  I  :  I  I  - !  ■  - !  iiiti-'tibiiii''-iii-iMu'b’' 

-,  1 1 .  I  i  - 1  i!  ■ :  I  I  -  ■■■  ■  1  '  .  b  . n  --■  -  -  -n  -  -Il  1\  I  !  I  w 

1  1 1  1 1 :  Hill- 
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3.1  Overview 

This  section  describes  a  method  of  evidence  combi¬ 
nation  which  performs  [iayesian  updating  using  evidence 
that  may  be  best  modelled  using  an  infinite  valued  logic 
such  a.s  that  which  fuzzy  set  theory  provides.  The 
methodology  described  in  this  section  provides  a  unified 
approach  for  intelligent  reasoning  in  domains  that 
include  probabilistic  uncertainty  as  well  as  interpretive 
or  "fuzzy"  uncertainly. 

A  formulation  central  to  several  components  of  the 
methodology  is  that  of  the  "weight  of  evidence"  and  is 
therefore  introduced  in  section  i.2.  The  description  (and 
justification)  of  the  use  of  an  infinite  valued  logic  is 
presented  in  .section  3.3  and  explains  how  "imjiortatii" 
symptom  sets  are  used.  Section  3.1  relates  the  perfor¬ 
mance  of  this  method  in  the  tlomain  of  e<iuine  colic  diag¬ 
noses. 


3.2  The  Weight  of  Evidence 

.\.M  Turing  originally  developed  a  formulation  for 
what  he  called  the  "weight  of  evidence  providetl  by  the 
evidence  E  towards  the  hypothesis  H"  or  VV(H:E).  Good 
8.9;  hiLs  subsequently  investigated  many  of  the  proper¬ 
ties  and  uses  of  Turing's  formulation  which  is  cxprc.s.sed 
a.s; 


=  log 


'  pU-:  in ' 

Or  W(H:E)  =  log 

0(H,K) 

piF.  H) 

0([f) 

where  0(11)  represents  the  odds  of  II. 


pun 

pim 


.  Weight  of 


evidence  (days  the  following  part  iii  Hayesiaii  inference: 


For  these  values  the  physician  may  say  that  the  tem¬ 
perature  is  "sort  of  normal"  or  "sort  of  normal  but  also 
sort  of  high".  This  is  a  different  and  separate  concept 
from  the  probability  of  an  eve  it  (using  either  the  belief 
or  likelihood  interpretations).  Implicitly,  probabilily 
theory  (in  both  interpretations)  assume  that  an  event 
either  happens  or  does  not  (is  true  or  false)  On  the 
practical  .side,  we  liave  found  that  the  concept  and  e.sli- 
ination  of  membership  functions  is  intuitively  ea.sy  for 
physicians. 

Let  P  be  a  fuzzy  subset  of  a  universe,  f  F  is  a 
set  of  pairs  {x.  s  £  L}  where  /i/'(x)  takes  a  value 

in  0,1 1.  This  value  is  called  the  grade  of  membershi|)  rif 
X  in  K  and  is  a  measure  of  the  level  of  truth  of  the  slate- 
riient  "x  is  a  member  of  the  set  F" 

strong  o  -  level  subset.  .1,,,  of  F  is  a  fuzzy  set 
whose  elements  must  have  a  grade  of  membership  of 
o  Formally  defined. 

.3,.  I  /  1 

For  (‘xaniple.  if  we  hriv<>  the  fuzzy  set  F  {xl  0.2, 
x2  0.7.  x3  0-0.  x-l.O-l}  then  the  strong  a  -  level  set 
{xl  0.2  x2  0.7.  x  l  0.  (}. 

Because  we  wish  to  perfo.rn  probabilistic  inference 
we  need  to  have  a  means  of  cah  ulating  the  probability  of 
fuzzy  events.  'I'wo  methods  have  been  suggeslcal  for  this, 
the  first  from  Za<leh  2()  and  the  second  from  ^’age^  25  . 
Zadeh's  formulation  i.s  as  follous: 

V{A)^  !  =  /■; 


Prior  log  odds  •  u  night  of  eiidenre,  posterior  log  f)dds 

.\  weight  of  evidence  which  is  highly  negative 
implies  that  there  js  significant  reason  to  believe  in  // 
svhile  a  positive  sufiports  (I.  'Phis  formulation 

hits  been  most  notably  used  in  a  dension  su()porf  system 
ealleo  Cil.ADYS  develriped  ijy  Spiegelhalter  23 

In  any  formulai  if»n  for  evidence  combination  using 
higher  f^rder  jcuiit  [)r‘)!)an'!;' '  a  ih‘Te  exists  the  prc.'blein 
of  evideiK-e  that  may  app(*ar  in  many  different  ways 
For  example,  a  patient  has  the  following  imfiortaiil 
sympfoni  gr<nips  (High  pain}.  (High  pain,  high  temp  ). 
(ffigh  (K'lin.  high  temfi,  high  piil.se)  ^Vhl<•h  of  the.se  .sym()- 
f'un  groups  shouhl  be  used’  I  smg  more  than  one  wouhl 
obviously  l>e  counting  the  evnlcnee  a  number  of  tunes 
d'he  riih*  w«*  have  chosen  tf)  resolve  this  situation  is  to 
choose  tlie  symploni  group  ba. ■;<■(!  on  a  comlunatioii  of 
ihe  grfui|)  s  si/e.  weiglit,  and  error  (n  tfiis  way  we  mav 
balaine  ifiese  factors  det)endirig  uj)oii  ilieir  importance  in 
ffie  domain  fr,r  e.xaffifde,  it  fiigher  order  defjendeney  Is 
lira  rveient  111  ad-rnain  then  i  !ie  m/*- of  a  group  is -.f  lit- 

I  le  imp'  jrf  aiif  e 

3.3  Events  as  Strong  nr  -  Level  Subsets 

Infinite  valuer]  logic  (I\'I.)  is  ba.sed  on  the  be!ir*r  that 
lognal  pirij)osit  lolls  arr-  Mot  neer'ssanly  just  true  or  false 
l>uf  may  fall  anvwliere  iii  0.1  Fu//v  set  i|ier»ry  is  ruie 
e'rrnm'Ui  l\  P  whn  h  pr'ivi'bs  a  means  o]'  r'-pTr-sr-tif thf 

frutll  r,|‘  t  su  bjrrf  IV  r*  of  I  tl  t  e  r  ( <  f '•  I  I  \  C  slatr'mellf  f  of 

example  wliat  a  pllV'iriall  r'lisldefs  1-'  I'e  a  "ll'-rmar' 
femperaf  ur»'  may  b--  uii'Ure  .,r  ’Tu/zv"  f-r  e.-rtain  valin*'' 


is  the  tiiembership  function  of  the  fuzzy  .^et  .-V.  and 
(a  €  0.1  .  5*agor  argues  (hat  it  appears  unnatural 
for  the  prol)abilily  of  a  fuz/y  subset  to  be  a  number  " 
We  would  further  argue  that  ZadelPs  formulation  dra-s 
not  truly  provide  a  pri;bability  of  a  fuzzy  event  but 
something  quite  difTerenl:  the  ex[)ecied  truth  value  o\'  a 
Oizzy  event.  Yager  proposics  that  the  probaliility  of  a 
fuzzy  event  be  a  fuzzy  subset  (fuzzy  probability): 


wliere  specifies  the  o  -  lev*!  subset  of  A  and  since 
/'(-■L.)  G  0.1  ,  IS  a  fuzzy  subset  of  0,1  .  'Fills  fu/zy 

subset  then  provides  a  probability  of  .\  Fu  every  o  - 
h-vel  subset  of  .A  'I'lius,  (lependiiig  on  the  recpiired  (or 
d«*sired)  degree  of  satisfaction,  a  probability  of  the  fu/zy 
event  IS  availabh'.  In  our  easr*  the  desirerf  |(‘V('I  .'f  fruflt 
IS  that  which  maximizes  the  lua.s  of  tins  event  to  ifie 
)iyj)Othesis  F<»r  exam)*li-.  if  Wr  Wl''li  to  set  a  degree  of 
satisfartion  for  the  proposition  "x  is  tall”  aiul  we  are  pn- 
in. inly  inten\>ted  m  whether  \  is  a  basketball  plaver  ili'Ui 
we  wish  to  clHMisr  an  o  levt  ]  whnli  allows  us  to  b»o.i 
dilfereniiate  HH  players  from  iion-HH  [•biveis  \\  r  ib  rnie 
tlll'^  optimal  -  level  to  lx- 

\f»f  |ii  (//  /,  ,i  I  ..  C  (t  J 

W  [If  /  ,1  Is  the  W|^;hl  I  if  fV  id-  IM  e  of  t  lir  st  (one  O  ■  h  v  .1 

•'III-''''!  I\  provided  t  -war-ls  ,h-'  livp-ahe^is  ||  'll . 

I'V'd  w  In- )i  maximi/*-''  '))'■  bi  i,-  - -f  ;t  tu/zv  rv-nt  t..  > 
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hypothesis  (or  null  hypothesis)  is  the  o[>tiinal  a  -  level 
for  minimizing  systcmal ie  noise  in  the  event. 

The  idenlilication  of  important  sets  ol  symptoms  or 
eliaraeterist ics  is  (hnie  ecjmmonly  by  Inimaii  medical 
exj)er(s  and  otfier  professionals.  For  example,  the  combi¬ 
nation  of  (abdorritti.i!  pain,  vomiting,  fever)  may  indicate 
append leit  is  wit  h  a  certain  probuhihty  or  level  ol 
eonfidenee.  Our  motivation  lor  trying  to  discover  impor¬ 
tant  symptom  groups  is  twotold;  to  idenlily  wliieh 
gfuiips  are  imf>^»rtati(  in  a  predictive  sense  and  to 

(juaiilify  how  iin[)orlant  a  group  is.  Also  a  factor  in  (he 
decision  of  using  symptom  grcaips  insteatl  of  individual 
variables  is  llie  belief  that  lluTc  exists  many  liigh  or<ler 
(iependciicies  in  tins  and  other  reaMife  domains.  .An 
symptom  set  may  be  of  any  size  between  I  and  N  where 
.\  IS  iiuml>er  of  possible  synifjtoms.  '!'<•  find  all  snch 
symplom  sets  recpiires  an  exliatist  ive  search  of  liigli  e<»m- 
hinatorial  complexity  'I'his  mat  be  reduced  .somewhat  by 
Hot  examining  groups  tliat  <‘oiit;tin  a  subset  of  symptoms 
which  are  \<'ry  rare.  cxamf>!c,  jf  fiaa|(lligh  pain.  l<»v\ 
pulse)  IS  very  low  then  we  iM-ed  n(»t  look  at  any  gl<‘Ups 
cruifamin.g  these  two  symptom''.  Our  present  iniplemeii- 
tati'.ui  t'Xatilincs  sets  up  to  size  lliree  'I'he  weight  of  evi¬ 
dence  of  each  symptom  group  i.s  mea-sured  from  the  <lata 
and  a  lest  of  signihcanee  decides  uhelher  this  gnaip  has 
a  ueiglji  .significantly  different  from  0,  Of  .significant 
interest  IS  the  clinician's  etui. .rsement  of  the  iinp<'.rtant 
sytnplom  .sets  that  this  method  found.  'I'hose  sets  which 
showed  as  l>eiiig  imp«'.rTant  using  the  weight  <*f  evnlcnce 
are  symptom  groups  tliat  the  cbnieiaii  would  also  <b‘ein 
as  being  significant , 

3.i  Implementation  and  Results 

I  III  piemen  t  at  I'.ui  ua.son  a  Se(picnt  parallel  pro<«‘s.s(>r 
with  I  (lilel  Mn.'lSfi's.  '{’lie  liietliod  was  cod<Ml  in  ('  and 
Ibuscal  and  made  riiucii  use  <»f  a  programming  interface 
to  the  OlFAC'ld'^  (MlIJMS  'I'his  fir''>vide(l  a  fK.»crfiil 
bleitfl  of  procedural  aild  noii-procedura!  languages  m  a 
juira.llel  [irograriuiiifig  eii vironrnc.at 

-  Data  wa.s  obtained  for  a  trainifig  set  of  2-'>3  <‘(|uine 
colic  <-ases  each  eomposrsj  of  ‘JO  cbnical  variables  .Also 
included  fr.r  eacfi  case  arc  sev<'ra(  pertinent  diagiiostu- 
codes  cftnician's  (fecisioii.  presence  of  a  surgnad  lesion, 
and  h-sion  type  'I'lie  prototype  .system  provules  a  predn. 
ti'.ii  for  tile  presence  of  suigical  lesions  \’eferinafv 
e\j,erts  coinmonly  have  problems  in  dillercnt  ia(  mg 
belNNcen  surgical  and  non-surgical  lesions  Of  [primary 
ouir.un  t-.  tlir  ciiiiKiafis  IS  the  negative  predi«liV‘-  value 
tfiat  I",  hou  ..fteii  a  surgical  lc''i.'fi  is  fU'opcrh  <jiagno*s/vJ 
It  a  surgical  h''ii<n  i''  prexrut  and  is  iiie<ur<’ctl\  diagnosed 
fl).  It  t/if  lcsi<rfi  IS  usually  fatal  for  the  horsa-  IV/scuJ'vl 
b.  low  IS  a  siiiiimai  N  of  our  rc-ults  using  <  a.ses  iiom 
t  he  t  raining  s'U 


From  these  results  we  can  s<‘e  ihat  the  meifiod  of 
evidence  combination  achn  ved  an  accuracy  for  negative 
pre<licljou  which  exceeded  t  lie  cli  ii  icia  n  s  lncorre<t  <liag* 
no.s{'.s  .are  being  |•e\■|ewed  by  (he  clinician  lc<  sfu-  if  sojne 
<’X(>latiatKUi  can  fie  found  'FIh-O'  .s<‘Cfn.s  to  be  n<>  corrcl.a- 
(i<ui  liefween  clinician's  errors  and  the  ei-mputer 
teeliiiu|ues  d'his  peiha[)s  indicates  that  the  clinician  is 
adc[>i  al  cases  which  are  difficult  for  our  ("chni<}ue.s  (and 
vice  versa). 

'I’lie  following  exam{)|e  shows  our  results  for  a  case 
which  ha<l  a  surgical  !e‘'ioii  but  w:us  not  oprrateci  on  bv 
the  clinicians.  'I'he  horse  displaycrl  the  following  syu/p- 
t«  'ins 


(»  inont )»->  ojil 

lljgll  reel  temp 

\’rry  liigh  pnt.se 

high  resp  rate 

( ‘ool  f  i-inp  a!  .\  t  rein 

Rediieecl  per  fiiilse 

.\oiiii  inin'ons  inein 

n-lill 

))i‘pres>fd 

1  lyp(jinc)t  ile 

Mod  .ibiloin  distension 

S|  iKU'Oga.st rir  rellijx 

No  rejiux 

Hrilux  I’ll 

Norin.al  recUal  Xain 

Dl.'tended  Ig  intestine 

N<*rMi  paeked  ee|l  vol 

Normal  tot.il  protein 

siTosang  eentesis 

High  alid  d'ol  Froi'un 

and  the  inetlH-cl  deterniiiied  tliat  the  following  evidence 
w  .j.s  imp'-rtani , 


Evidence  Towards  Sureical  Lesion; 

Symptom  Group 

\V(H:Fa 

.Xdiilt  .1 1\ pcMiK'l  ili'.Nlod  abiioiii  disi 

1  o:.;i 

SI  ii;i.s.xK::Lst  Di.st  1,  l.,\i>rni  T'>i  I’ro. 

o.siii 

\  liicii  pulse. iii'd  per  pulse,  ('  r.d'  ■  :!s 

(1  sill 

(■'."I  letup  Xireiri.N'i  re|l nx.Nc .rriiul 

O.lilil 

1  hull  reel  1  einp.  depressetl 

(I  .i7-2 

.Sep  .s:iri,r  .-illdi  .111 irifK-ell  tests 

0  :!ii> 

Evidence  Against  Lesion; 

11  resp  raltvriorni  mucous  inem, 
m iiun  reel  \am 

(1  17 

Final  Results: 


I’nc.r  l.og  Odds  -  0  .^iiO 

\V(H  I-: ) _ -  d  aTd 

lb>st  Ofi,!-.  ==  \  t():i 

■  id^u rg  lesion  1  0  HO  I 


( 'oiiiparmon  t*f  Predn  iive  Ib.wer 

VI  (  ■ 

Negaliv*-  |’redicl|\e  \  alue 

1  *>  .sM  l\  (■  F  t  •  d  i'  t  1  Ve  \  jIu'' 

(  d  111  n  la  ip' 

X7  ti'  V 

IIH)  , 

V\  r  igfit  .  ,f  fbVIcb'flcc 

!l(i  7'V 

sn  b'  ; 

•  }  i:iit*-s '.bi:im*-d  from  i  fu-sr  kC)  |*re\ i. ais u- 

dw"  It.iVf  sh-.uii  ihe'.r  valiM's  to  !«*•  Td' r  and  ‘Id' c 
r« '.p«-,  1 1\ ,  I\  .o.-r  a  large  saliijde 
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For  this  case,  this  method  strongly  supports  the 
surgical  lesion  hypothesis,  ll  is  interesting  to  compare 
these  results  to  that  of  the  classical  regression  rtiodei. 
I’sing  this  model  the  prol)a!)ilit v  of  a  lesion,  p,  is 

predicted  bv;  p  =  - rr.  where 

(1-c^’ 

1'  —  7.St>  —  1.73(.‘\‘J)  -  1  a  l(/r(  (pid.sd  ))  —  ().  in8( />i.sren.s;oK  ) 

In  this  cjuse  .A2  was  1  because  ;he  horse  had  a  distended 
large  intestine,  the  pulse  was  11  I,  and  the  distimsion  was 
4  (moderate).  Substituting  into  the  formula  we  g(“t; 

>■  =  7. SO  -  1.73  -  J  5I(/h(H-1))  -0.108(3) 
and  when  wr  solve  for  p  nn(“  find  that  p  0.5818 
d  ims  the  (-{assiea!  regressicm  analysis  pro<luccs  a  [>  valu<* 
greater  thati  5.  but  which  is  not  strongly  cc.»nclusivc. 

It  is  interesting  to  combine  these  findings  witli  the 
of  Bayesian  ciassificat  ion  This  case  belongs  to 
autoclass  1  (determined  with  outcome  information 
excluded).  In  this  chiss.  SO.B’c  of  the  2'2  c:tses  had  a 
lesion,  close  to  the  value  predic-ted  but  the  we  ight  (A  evi¬ 
dence  formula.  However,  in  this  class  only  10*7  of  tlie 
cases  lived  and  only  15^7  of  tlm  animals  winch  liad  a 
lesioi)  and  were  operated  on  actually  lived.  lbithoK)gy 
variables  were  particularly  important  for  determining 
tliaf  class  and  abdominal  <listci;sion  wils  only  moderately 
sigiiificrint  (although  the  d<;et<.rs  and  logistic  regr<‘ssioii 
rely  ‘111  this  variable),  d'he  sigiiiflcHiU  variables  found  by 
the  BLSl  algorithm  were  also  of  very  high  weight  f<.»r 
class  (One  niefisurerneiit  r>f  a  patliology  variai)le 
necessary  for  making  a  diagnosis  was  missing,  d'he  other 
factors  imlicated  a  slight  preference  f<»r  the  presence  of  :• 
lesion)  'Fhe  .Autoclass  information  suggests  a  ptfor  out¬ 
come  jifognosis  in  any  case  and  indeed  this  was  a  Mtiia- 
tiiUi  in  which  the  clinicians  decided  on  euthanasia 
I{egr<'ssioii  and  vveiglil  of  evidence  techniques  alone 
vv'iiild  not  have  MigiTcsted  f h is  deej'-a ui 
4.0  Conclusions  and  Further  Work 

d  ills  paper  hius  aiternptcd  t<)  provide  an  idea  c-f  the 
nu'iliods  la-iiig  u^crl  tc  rxiraci  inb-rniai  i<  jH  fr«»m  data  ni 
Ih'-  develnpnieii (,j  ;tii  I  ll  fol' II 1  a  t  c  »Ii  system  I'tr  mcdnal 
dlag^n^e'',  Several  tec|inK|ueN  have  been  prcseiilctj  ;ind 
M'liie  initial  ciinipari''on'<  ma<lc 

S'une  tests  o)  |)r rl orma II 'e  Were  a<e< »mp hs|ic<j  by 
keejung  a.'iide  portions  ot  the  test  dat.t  .\  lievs  data  set 
IS  'iirrentlv  bemt:  taflicped  wiHi  eases  which  wdl  be 
used  to  b'Uli  test  tile  Ul  (  li  ( -r  1.  .ji  lU  n-s  aiui  llicil  letlhr  ih** 
pr''s«[if  d  laglH  ist  le  results  lh|s  iie\s  flata  set  ha.s  been 
ditli'ult  I"  •■etiieve  a.s  n<it  all  the  data  used  in  the 
'■ri.oiial  set  was  tui-liiie  W'-  .^re  luknm  p.  en'^tji-e 

til'-  iiif' 'riiia  1 1<  111  t.iken  ir-aii  liai.l  '<ppv  i  * ' ,  ua)''  is  eni  irei\ 

■lai'i-teiit  wp  ll  I  he  li  r-t  1 1  aiiiifi’^  set 

We  life  I'.okmg  at  a  tm-re  ‘•taiMiard  Bavesian 

iii''drd  and  trying  t"  undersfatid  the  dependeri' tes  and 
'  r'opfjji  i<  unng  ifi  the  data  d'lie  iih-I  In  »d‘ ‘gies  u'^e.] 

liei-‘  al.sc  help  ''hed  s-Ulie  ilg[|1  t,n  Bus  'The  w-'fks  |.v  | 

lb  and  Sjurgelhalter  10  are  lieiiig  e-'li'-idered  b-r 
Ml :  -  a  I'l'T'  >aeli 

In  terms  ..f  developing  an  aeiual  s\:.?«'m  u'-nig  «h<* 
rii' '  hod"bu;|.s  ol  part  4  a  piototvpe  li,».s  .ilre.idv  be.-n 
iij)p!i'meiit<'fi  wh|.  h  '••on-ideiH  -ympt.-m  gtoii[r-  up  t.. 
^1/'  ■ -!  three  ,\  iii'ire  advanced  algoriMnii  wheh 

ej)fjf.fM  e  letw.-n  gl'aip-  alld  pri'Vlde-  .lIl  elToreOi. 
m  pe  I-.  Mjrrenflv  benm  m»(  i- menii’d  m  a  Ida' klw.ard 

af'  i  It  e<  I  u  f, 
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Terminals  exist  in  the  h'  spital  in  the  work  are:i.s 
used  by  the  clinicians  and  wc  arc  now  proceeding  to 
make  the  results  available  on  incoming  cases.  Any  linal 
system  would  provide  the  doctors  with  selected  inforrna- 
lu»n  from  several  methodologn's.  d'his  is  to  help  esja-- 
cialiy  with  the  diagnosis  of  diiricnli  cases  -  as  the  real 
(|iie.stion  is  not  just  to  f)e  statistically  accurate  a  {ertaiti 
pereeiitage  of  the  time,  but  to  provide  diagnostic  aid.s  fc-r 
the  Itardcr  cases 
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ABSTRACT 


This  paper  presents  a  comparative 
study  of  six  major,  leading  methods  for 
reasoning:  (1)  Bayes'  Rule,  (2) 
Dempster-Shafer  theory,  (3)  Fuzzy  Set 
Theory,  (4)  MYCIN  Model,  (5)  Cohen's 
System  of  inductive  probabilities,  and 
(6)  a  class  of  Non-monotonic  reasoning 
methods.  Each  method  is  presented  and 
discussed  in  terms  of  theoretical 
content,  a  detailed  numerical  example, 
and  a  list  ostrengths  and  limitations. 
Purposely,  the  same  numerical  example  is 
addressed  by  each  method  so  that  we  are 
able  to  highlight  the  assumptions  and 
computational  requirements  that  are 
specific  to  each  method  in  a  consistent 
manner.  Guidelines  are  offered  to  assist 
in  the  selection  of  the  method  that  is 
most  appropriate  for  a  particular 
problem 

KEY  WORDS :  Inference  models,  expert 
systems,  imperfect  knowledge, 
uncertainty,  decision  support  systems, 
inference  network,  evidential  reasoning. 

1.  INTRODUCTION 

Intelligent  systems  that  support 
human  judgement  and  choice  are  computer- 
implemented  procedures  that  seek  to 
combine  knowledge  about  a  domain  (e.g., 
problem  or  situation)  with  methods  of 
conceptualizing,  structuring  and 
reasoning  about  such  a  domain.  They 
incorporate,  additionally,  "formal" 
methods  of  reasoning  about  the  domain 
that  need  to  be  brought  to  bear  when 
tasks  are  poorly  understood  and 
structured,  or  when  the  information 
available  is  incomplete,  fragmented,  or 
otherwise  imperfect.  The  ability  of 
such  computer-based  expert  systems  to  be 
able  to  look  at  part  of  the  "picture" 
available  and  then  make  inferences  about 
the  true  nature  of  the  problem  rests 
upon  a  knowledge  base  that  is  able  to 
combine  pieces  of  information  available 
and  utilize  appropriate  reasoning 
methods.  Such  reasoning  methods  include 
the  heuristics  or  informal  "rules  of 
thumb"  that  people  use  to  rapidly  find 
solutions  to  problems,  as  well  as  formal 


reasoning  methods  that  are  useful  in 
resolving  problems  about  which 
experiential  familiarity  is  slight. 

A  major  motivation  for  this  paper  is 
the  need  to  assess  the  progress  in  the 
development  of  methods  that  utilize 
imperfect  information,  and  algorithms 
that  offer  the  potential  for  utilization 
in  computer-based  expert  systems.  Many 
papers  in  the  literature  discuss  one 
method,  but  few  consider  in  a 
comparative  manner  two  or  more  methods. 
As  a  result,  the  reader  must  deal  with 
imperfect  information  about  the 
capabilities,  applicability  and 
limitations  of  each  method. 

This  paper  borrows  from  other  works 
in  many  respects.  One  of  the  numerical 
examples  used  throughout  this  paper,  for 
instance,  is  originally  due  to  Lee, 
Grize  and  Dehnad  (1987),  who  have 
demonstrated  four  of  the  methods, 
specifically:  (1)  Bayes'  Rule,  (2) 
Dempster-Shafer,  (3)  Fuzzy  Set  theory, 
and  (4)  the  MYCIN  model.  This  paper 
presents  their  treatment  of  the  first 
three  methods,  proceeds  to  expand  on 
the  description  of  and  example  for  the 
MYCIN  model,  and  two  new  examples  are 
constructed  to  illustrate  (5)  Cohen's 
System  of  inductive  probabilities  and 
(6)  a  class  of  Non-monotonic  reasoning 
methods.  During  the  construction  of  the 
example  for  the  non-monotonic  reasoning 
method,  valuable  insight  into  the  method 
was  provided  by  the  previous  work  of 
Cohen,  Watson  and  Barret  (1985)  who 
present  a  realistic  application  to  image 
analysis.  Also,  a  comparative  analysis 
by  Black  and  Eddy  (3985)  has  helped 
greatly  in  the  discussion  of  the 
strengths  and  limitations  of  each 
method . 

Interested  readers  are  invited  to 
write  to  the  author  for  a  complete 
version  of  this  paper. 

Consider  the  following  diagnostic 
problem  due  to  Lee  ct  al.  (1987): 
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2.  BAYES'  RULE 

The  problem  would  restated  as 
follows: 

IF:  a  person  X  has  a  runny  nose,  and 

X  has  irritated  eyes, 

THEN:  conclude  that  X  has  only  a  common 

cold  with  probability  Pj^,  and 
conclude  that  X  has  only  a 
nostril  allergy  with  probability 
P2 ,  AND  conclude  that  X  has 
neither  a  cold  nor  an  allergy 
with  probability  p^,  AND  conclude 
that  X  has  both  a  cold  and  an 
allergy  with  probability  p^. 

Also,  we  let  the  evidence  be 
E;  X  has  a  runny  nose  and 

irritated  eyes,  and  let  the 
set  of  hypotheses  be 
X  has  only  a  common  cold 
H2:  X  has  only  a  nostril  allergy 
H3:  X  has  neither  a  cold  nor  an 
allergy,  and 

X  has  both  a  cold  and  an 
allergy . 

By  Bayes'  Rule  we  have  that 


Pi  =  P(Hi|E) 

=  P(EtHj)  P(H3) |C 
where 

C  =  P(E)H3)P(Hj)  +  PfE|H2)P(H2)  + 
P(e1h3)P(H3)  +  P(E|H4)P(H4) . 

Strengths  and  Limitations.  There 
are  number  of  significant  drawbacks  in 
applying  Bayes'  rule  to  expert  systems: 

(1)  the  rule  requires  all  the 
hypotheses  to  be  disjoint  and,  in  a 
large  expert  system,  dividing  a  solution 
space  into  mutually  exclusive  subsets 
may  be  expensive; 

(2)  in  the  event  of  altering  the 
probability  of  an  event  in  a  system  (by 
adding  or  removing  hypotheses)  we  would 
need  to  recalculate  all  the 
probabi 1 ities ; 

(3)  there  is  no  guarantee  that  the 
set  of  probabilities  built  into  an 
expert  system  is  consistent  and 
coherent;  for  example,  the  product 
P(AjB)P(B)  may  or  may  not  be  equal  to 
P(b|a)P(A) ; 

(4)  in  realistic  situations 
evidentiary  information  can  quickly 
translate  into  very  long  sums  and 
products  'f  conditional  and  marginal 
distributions  requiring  substantial 
storage  and  computing  resources. 


3.  THE  DEMPSTER-SHAFER  THEORY  OF 
EVIDENCE 

This  theory  of  mathematical 
evidence  (Shafer,  1976;  Dempstel ,  1967) 
is  basically  a  set-theoretic 
generalization  of  Bayesian  theory. 


There  are  some  problems  in  the 
theory  that  are  yet  to  be  addressed  in 
greater  detail: 

(1)  Dempster's  rule  of  combination 
cannot  be  applied  in  situations  where 
there  are  considerable  disagreements 
among  the  evidence,  that  is,  when  the 
cores  of  two  belief  functions  are 
disjoint; 

(2)  in  relistic  cases  a  long  chain  of 

inferences  may  make  the  theory  very 
inconvenient  and  expensive  to  use 

because  of  the  increasing  complexity  in 
the  structure  of  the  core  of  the  belief 
functions ; 

(3)  the  numerical  stability  of  the 

theory  needs  to  be  analyzed  further;  in 
some  cases,  small  variations  in  the 

basic  probability  assignments  can 
produce  a  large  variation  in  the 

results . 

4 .  VAGUENESS  IN  FUZZY  SET  THEORY 

In  contrast  to  probability  and 

evidence  theory  as  models  for 
representing  uncertainty,  a  theory  of 
possibility  was  proposed  by  Zadeh  (1978) 
to  represent  vagueness  inherent  in  some 
linguistic  terms. 

Our  problem  is  decomposed  into  *‘wo 
rules: 

Rule  1:  IF  a  person  X  definitely  has  a 

runny  nose, 

AND  X  definitely  has  irritated 
eyes, 

THEN  X  probably  has  a  common 
cold ; 

Rule  2:  IF  a  person  X  definitely  has  a 

runny  nose, 

AND  X  definitely  has  irritated 
eyes , 

THEN  X  may  or  may  not  have  a 
nostril  allergy. 

These  two  rules  make  use  of  the  term 
set  T(  ) : 

T(  )  =  [definitely  not,  probably  not, 

may  or  may  not,  probably,  definitely]. 

There  have  been  a  number  of 
applications  of  fuzzy  logic  to  expert 
systems,  including  .SPII  (Martin  and 
Pradee,  1986) ,  and  REVEAL  (Jones  and 
Morton,  1982).  Some  observations  on 
possible  drawbacks: 

(1)  the  maximum  and  minimum  rules  for 

disjunction  and  conjunction  may  cancel 
valuable  information  when  fuzzy 

individual  assignments  to  various  pieces 
of  evidence  include  one  assignment  that 
is  very  close  to  zero; 

(2)  membership  functions  are  context- 

sensitive;  for  example,  a  "small" 

building  can  be  bigger  than  a  "big" 
house;  generic  membership  functions,  if 
applied  blindly,  can  lead  to  misleading 
results; 

(3)  c:omputat  i  ona  1  and  storage 

requirements  can  be  large  whenever 
individual  membership  functions  are  non¬ 
linear,  non-trivial;  discrete 
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CF(H,E)  =  MB(H,E)  -  MD(H,E). 


approximations  of  non-linear  membership 
functions  place  a  significant  demand  on 
computer  storage  and  computational 
requirements . 

5.  COHEN'S  SYSTEM  OF  INDUCTIVE 
PROBABLITIES 

Among  the  several  researchers  who 
have  noticed  anomalies  and  paradoxes  in 
the  application  of  conventional  Bayesian 
probability  to  inference  in  certain 
situations  is  the  Oxford  logician  L.J. 
Cohen.  In  Cohen's  system,  "inductive 
probabilites"  are  assigned  to 
alternative  hypotheses  (Cohen,  1980) . 

Cohen's  system  is  congenial  to  a 
process  called  "induction  by 

elimination",  as  one  proceeds  to  use 
evidence  as  a  basis  for  elimination  of 
some  hypotheses,  such  that  the 
hypothesis  resisting  being  classified  as 
"false"  by  evidence  is  then  considered 
to  be  "correct",  at  least  tentatively. 

Figure  1  presents  an  illustration  of 
an  application  to  an  inferential  task 
involving  four  hypotheses  about  a 
particular  situation.  Evidence  is 

gathered  resulting  on  n  evidentiary 
points.  Each  evidentiary  point  can  be 
thought  of  as  a  "test"  to  apply  to  each 
hypothesis;  some  hypotheses  pass  (P) 
this  test  while  others  fail  (F)  .  At 
level  four,  that  is  after  E^^,  £3,  E, , 

and  E4  have  been  considered,  the 
assessed  inductive  probabilities  (IP) 
are  as  follows: 

Inductive 

Hypotheses  Probabilities  at  level  i=4 

IP(H|)^  evidence)  =  2/n 

H2  evidence)  =  3/n 

IP(H3  evidence)  =  1/n 
IP(H,j  evidence)  =  2/n 

and  so  it  appears  that  hypothesis  H2  is 
the  one  that  the  pieces  of  evidence  E,, 
E2,  E3 ,  and  E4  support  the  most, 
tentatively.  Read  Schum  (1987)  for  an 
in-depth  study  of  Cohen's  system  as  it 
contrasts  with  Bayesian  theory. 

6.  MYCIN  CERTAINTY  FACTORS 

The  MYCIN  experiment  of  Shortliffe 
and  Buchanan  (1975)  v;as  originally 
applied  to  a  subdomain  of  medicine  where 
little  reliable  data  is  available,  and  a 
rigorous  application  of  Baye's  rule 
would  be  difficult  if  not  impossible. 

MYCIN's  theoretical  framework 
includes  terminology  such  as  "measures 
of  belief",  denoted  MB,  "measures  of 
disbelief",  denoted  MD,  and  "certainty 
factors",  CF.  Formally,  these  are 

defined  as: 

MB(H,E)  =  the  measure  of  the  belief 
in  the  hypothesis  H,  given  evidence  E, 

MD(H,E)  =  the  measure  of  the 
disbelief  in  the  hypothesis  H,  given 
evidence  E,and 


Since  MB(H,E)  is  a  number  between  0  and 
1,  and  MD(H,E)  is  also  a  number  between 
0  and  1,  the  certainty  factor  CF(H,E)  is 
a  number  between  -1  and  +1.  A  positive 
CF  indicates  that  there  is  more  reason 
to  believe  a  hypothesis  than  to 
disbelieve  it.  A  negative  CF  means  that 
a  hypothesis  is  more  strongly  rejected 
than  confirmed.  A  CF  of  zero,  is  a 
"don't  know"  value  which  tells  us  that  a 
hypothesis  is  independent  of  some 
evidence.  Measures  of  belief  are  used 
in  an  inference  network  such  as  the  one 
shown  in  Figure  2  to  propagate  evidence, 
leading  to  a  hypothesis. 

7.  NON-MONOTONIC  REASONING 

Monotonic  systems  of  thought  are 
such  that  beginning  with  an  initial  set 
of  premises,  the  number  of  statements  or 
theorems  that  have  to  be  shown  true 
(e.g.,  to  be  proven  as  true)  increases 
monotonically  (increases  continouusly ) 
over  time  as  new  axioms  or  premises  are 
added  on.  This  is  generally  the  case 
for  many  traditional,  axiomatic  formal 
systems  of  reasoning. 

By  contrast,  in  non-monotonic 
systems  of  thought,  the  number  of 
practical  structure.^  of  argument  and 
belief  may  increase  as  well  as  decrease 
over  time.  This  may  be  so  because  new 
data  may  compel  an  analyst  to 
conclusions.  Humans  become  skilled  at 
merging  conflicting  data  into  existing 
arguments  or  beliefs  so  as  to  regain 
consistency  while  minimally  disrupting 
the  book-keeping  activities  within  such 
a  system. 

A  key  concept  in  implementing  non¬ 
monotonic  systems  is  that  of  dependency- 
directed  backtracking.  As  data  and 
constraints  are  added  to  a  non-monotonic 
system,  they  are  treated  as  valid  until 
a  contrad iction  is  found;  when  and  if  a 
contradiction  is  found,  the  system 
rearranges  tlie  set  of  beliefs  that  are 
"IN"  (e.g.,  considered  to  be  valid, 
true),  and  the  set  of  beliefs  that  are 
"OUT"  (e.g.,  considered  to  be  not  valid, 
not  true).  Traditional  systems,  in  the 
face  of  contradiction,  must  backtrack 
past  the  data  that  was  added  immediately 
prior  to  the  contradiction  and  then 
search  for  a  path  that  is  free  of 
contrad ict ions .  As  a  result,  many  dead 
ends  are  encountered  with  exhaustive 
searches  before  a  consistent  total  set 
of  beliefs  found  (if  available,  at  all). 
In  a  non-monotonic  system,  only  those 
beliefs  that  actually  contribute  to  a 
contradiction  need  to  be  examined. 

During  the  knowledge-representation 
part  of  the  problem  use  is  made  of  data 
structures  called  support  lists.  A 
support  list  (SL)  justification  for  a 
statement  has  the  form: 


35.S 


Statement# 

— 

Statement 

- 

(SL  (inlist)(oullisl)) 

Such  a  justification  is  a  valid 
reason  for  belief  in  the  statement  if 
every  statement  in  its  inlist  is 
believed  to  be  true,  and  every  statement 
in  its  outlist  is  not  believed  to  be 
true.  Two  types  of  justifications  are 
used  most  frequently: 

(1)  A  premise  justification  has  an  empty 
inlist  and  an  empty  outlist,  i.e., 
(SL()()).  Nothing  else  needs  to  be 
demonstrated  to  ensure  acceptance  of  a 
statement  with  such  a  justification. 
Observed  data  and  (unquestioned;  general 
principles  might  be  treated  this  way. 
For  example, 

N-1  Person  X  has  a  runny  nose  SL()()) 


is  automatically  regarded  as  IN. 

(2)  A  monotonic  justification  has  a  non¬ 
empty  inlist,  but  an  empty  outlist,  as 
in 

N-2  Person  X  has  (SL(Person  X  has  a 

a  nasal  runny  nose) 

congestion  (nasal  membranes 

are  normal ) ) 

8 .  A  COMPARISON  OF  THEORIES 

Figure  3  depicts  the  format  and 
content  of  the  conclusions  reached  by 
some  of  these  methods.  Conclusions  are 
not  and  cannot  be  identical  given  the 
different  calculi  employed  by  these 
methods.  Table  1  presents  some  general 
observations  on  the  computational  and 
structural  requirements  of  each  method. 


Alternative  Hypotheses: 


Evidence: 

H1 

cold 

only 

H2 

allergy 

only 

H3 

no  cold, 

no  allergy 

H4 

cold  and 

allergy 

El;  Runny  nose 

P 

P 

P 

P 

E2:  Irritated  eyes 

P 

F 

F 

P 

E3:  Test  tesults  of  nasal 

tissue  culture 

F 

P 

F 

F 

E4:  Itching  of  nose 

and  throat 

P 

F 

F 

P 

E5;  Medication  B 

stops  itching 

F 

P 

F 

P 

E6:  Medication  A 

alleviates  runny  nose 

P 

F 

F 

F 

E7:  Swelling  of  nasal 

membranes 

F 

P 

F 

P 

E8:  Medication  A 

causes  drowsiness 

P 

P 

_ 

P 

P 

legend: 

P.  Pass 
F:  Fall 


Figure  1.  Hypothesis  testing  in  diagnostic  problem. 


which  can  serve  as  guidelines  for 
matching  a  given  problem  to  the  most 
appropriate  method 

"John  has  a  common  cold" 

D 

A 


C  and  D 


(evidence  E2)  (evidence  E1) 


A 

A 

Swollen  nasal  membranes 
(evidence  Ep) 


Figure  2.  Inference  network  for  diagnostic  problem. 
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Abstract 

This  paper  studies  computational  issues  in 
applying  the  theory  of  belief  functions  to  the 
method  of  paired  con^risons.  General 
algorithms  are  derived,  and  special  cases 
depending  on  the  focus  of  the  analysis  are 
studied. 

KEYWORDS:  Belief  functions;  Paired 
comparisons;  Preference  modelling. 

1.  Introduction 

The  paired  comparison  experiment  is 
familiar  in  the  social  cmd  decision  sciences. 

A  subject  makes  pairwise  comparisons  between 
members  of  a  set  A  -  {ai,  a2,...,an}  of  n 
objects,  choosing  the  most  preferred  object  of 
each  pair.  There  are  several  methods  used  to 
infer  a  preference  relation,  often  a  ranking, 
from  the  dominance  data  consisting  of  the 
stated  choices,  (see  e.g.  David  (1963,  1971), 
Thompson  and  Remage  (1964),  and  Flueck  and 
Korsh  (1975)). 

Tritchler  and  Lockwood  (1988)  considered  an 
extension  of  paired  comparisons  in  which  a 
certainty  factor  between  0  and  1  is  expressed 
for  each  choice.  They  applied  the  theory  of 
belief  functions  to  study  the  weight  given  by 
this  data  to  various  preference  relations,  and 
derived  various  diagnostics  describing  the 
violation  of  transitivity  emd  symmetry  axioms 
by  the  subject. 

The  belief  function  analysis  of  paired 
comparisons  is  very  time-consuming 
computationally.  This  paper  studies  the 
computational  problem  and  derives  algorithms. 
In  section  2  we  give  a  graph  theoretic 
formulation  of  the  belief  function 
methodology.  Section  3  poses  the  computing 
problem  and  gives  an  algorithm  for  computing 
basic  prob^d>illty  numbers  for  the  general 
case.  In  section  4  we  give  an  algorithm  for 
computing  the  beliefs  for  the  'best'  object. 
Section  5  discusses  Monte  Carlo  methods. 
Section  6  discusses  computations  for  singleton 
focal  elements. 


2.  Preliminaries 

The  essential  framework  of  theory  of  belief 
functions  was  established  by  Dempster  (1966, 
1967  a,b,  1968  a,b,  1969).  Shafer  (1976) 
elaborated  and  extended  the  theory.  The  reader 
can  consult  Tritchler  and  Lockwood  (1988)  for 
a  brief  summary  of  the  theory  which  is 
notationally  consistent  with  the  product  space 
. station  of  Kong  (1986)  used  in  this  paper* 


and  for  necessary  concepts  from  graph  theory. 
In  this  section  we  use  a  graphical 
representation  to  describe  the  application  of 
the  theory  to  paired  comparisons.  We  begin 
with  a  frame  of  discernment  ei3={ai-»a3,  aj-»ai} 
for  each  pair  of  elements,  where  ai-*ai 
indicates  that  a,,  is  preferred  to  a:).  A 
simple  support  function  SUP1.3  on  is 
obtained  from  a  comparison  of  a,,  with  a^.  If 
at  was  chosen  with  certainty  ri  we  obtain  a 
basic  probability  function  m  whose  foca’ 
elements  are  the  subsets  {a±-*a3}  and  813 ; 
m({ai-»a3})=  fi,  ai«3  m(6i3)  =  1-ri. 

If  2mother  comparison  is  made  choosing  an, 
and  the  resulting  basic  prob2d}ility  function 
m({an-*ai})  =  r*,  m(6tn)  =  l-r^  is  combined 
with  the  first,  we  obtain  the  orthogonal  sum 

m({at-*an})  =  rt(l-r2)/(l-rtr2) , 

m({an-*ai})  =  r2(l-ri)/(l-rir2) , 

m(e)  =  (l-ri)(l-r2)/(l-rir2) . 

The  conflict  between  the  two  belief  functions 
is  (l-rtrz)'^.  In  general,  after  combining 
any  number  of  belief  functions  over  the  freune 
Bin,  the  possible  focal  elements  are  {an-^ai}, 
{ai-*an},  and  Sin  and  we  denote  the  orthogonal 
sum  Belkin »  with  basic  probability  function 
nPin.  We  denote  the  conflict  among  the  simple 
support  functions  over  Sin  by  K^in*  Each  of 
these  focal  elements  has  a  canonical  graph 
defined  as  follows;  the  canonical  graph  of  a 
singleton  focal  element  is  that  arc,  and  the 
canonical  graph  of  Sin  is  the  graph  with 
vertices  {ai,  an)  and  no  arc.  Thus,  the  subset 
^in  of  Bin  corresponding  to  a  canonical  graph 
G(^i.:j)  is  the  set  of  all  asymmetric  graphs 
with  node  set  {ai,  a^}  which  contain  0(^13). 

We  define  8(S)  to  be  the  product  space 
B(S)  =  nBij  where  the  product  is  over  the 
indices  (i,j)eS  for  S  =  {(i,j);  l<i<j^};  B(S) 
consists  of  all  asymmetric  relations  (or 
graphs)  with  node  set  A.  To  combine  the 
evidence  from  all  of  the  pairs,  each  Bel^t:] 
must  be  minimally  extended  from  6±]  to  B(S) 
giving  Bel^i^tBO)  and  tl>en  the  orthogonal  sum 


Bel“  = 

Bel°i2tB(S)  e  Bel°,.3t8(S)©. .  .®Bel°„-i  .>78(5) 

taken.  The  minimal  extension  of  Bel°i-3  to  8(S) 
assigns  basic  probability  number  m“t3  to  each 
set  of  the  form  ♦is  x6(S  -  {(i,j)}),  where 
^1.3  is  a  subset  of  813.  We  define  the 
canonical  graph  G  of  a  focal  element  of  the 
minimal  extension  Bel“i37Q(S)  to  have  the  same 
set  of  arcs  as  the  canonical  graph  of  the 
corresponding  focal  element  of  the  marginal 
belief  function  Lei”.  .;  it  wj'l'  re"-c:r-.t  the 
set  of  all  asymmetric  graphs  with  node  set  A 
which  contain  G.  We  will  write  this  set  as 
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and  Lockwood,  1988,  Lenna  6). 


<e(S)|G).  The  focal  eleisents  of  Bel^  are 
forised  from  intersections  of  focal  elements  of 
the  Bel^i^tecs) .  Each  focal  element  of  Bel“  is 
of  the  form 

♦  =  1^12  X  <>13  X  ...  X 

where  ♦ij  is  a  focal  element  of  Bel^u  and 
each  such  set  product  defines  a  focal  element 
of  Bel°  with  basic  probability  assignment 

m°(+)  =  m“ia(^i2)iiPi3(+i3). . .  m°„-i  „) 

(Tritchler  and  Lockwood,  1988).  In  graphical 
terms,  the  focal  elements  of  Bel'’  have 
canonical  graph 

G(<^)  -  GiaC^ra)  U  Ga.3(4*i3 )  U  .  .  .  UGn-i  tx)  i 

where  Gi.:](<^i.:])  is  the  canonical  graph  of  ^13. 
Thus,  intersections  of  focal  elements  of  the 
Bel°i3  are  represented  by  unions  of  disjoint 
sets  of  arcs. 

To  introduce  the  assumptions  of  the  linear 
ordering  preference  model  we  define  a  belief 
function  Bel*-  on  e(S)  with  basic  probability 
function  m*-(L)  =  1  where  L  is  the  subset  of 
6(S)  consisting  of  all  complete  transitive 
Irreflexive  asynmetric  graphs  (linear 
orderings).  Then  Bel  -  Bel'’  e  Bel'-  describes 
the  subject's  preference  over  A,  constrained 
by  our  assumptions  about  the  structure  of  his 
preferences . 

Tritchler  and  Lockwood  (1988)  show  that  a 
focal  element  9  of  Bel  has  canonical  graph 
G(9)  =  G'=(4>),  where  ^  is  a  focal  element  of 
Bel*’  and  G''(+)  is  the  transitive  closure  of 
G(+).  Further,  m(9)  =  K*-  m”(+)  where  the 

sunnation  is  over  every  focal  element  ^  of 
Bel"  such  that  G'’(^)  -  G(9),  and  K'-  is  the 
conflict  between  Bel°  and  Bel*-.  K*-  is  an  index 
of  the  subject's  circularity  since 
K*-  =  [1-  I  ■»“(♦)]-’•,  where  the  summation  is 
over  focal  elements  ^  of  Bel"  such  that  G(<^) 
contains  a  cycle. 

We  can  represent  a  partial  order  in  two 
ways;  as  a  transitive,  asymmetric  graph  H  or 
alternatively,  as  a  set  (LjH),  the  set  of  all 
linear  extensions  of  H.  The  focal  elements  of 
Bel  are  of  the  form  (l|h),  H  a  partial  order. 
Also,  (l|G'’)  =  (0(5)  |G)  n  l  is  nonempty  iff  G 
is  acyclic  (Tritchler  and  Lockwood,  1988). 

This  expression  describes  the  Intersection  of 
a  focal  element  of  Bel”  with  the  focal  element 
of  Bel"-. 

For  ease  of  exposition,  we  have  formed  Bel” 
from  the  simple  support  functions 
corresponding  to  the  comparisons  in  2  steps: 
first  we  combined  replications  of  the  same 
comparison  over  and  then  the  resulting 
Bel”t:)  were  combined  over  0(5).  However, 
Dempster's  :  me  xs  coi.uucative  and  the 
combination  can  actually  tsJte  place  in  any 
order.  The  total  conflict  when  computing  Bel 
is  K  =  K"-  •  K”,  where  K”  =  n  K”i3  (Tritchler 


3.  Computing  Basic  Probability  Wiimhar«i 

We  can  dismiss  the  possibility  of  computing 
beliefs  for  all  subsets  of  L,  for  both 
computational  cuid  interpretive  reasons.  The 
complexity  of  2"'  (implied  by  the  number  of 
subsets)  is  clearly  not  feasible,  emd  most  of 
those  subsets  will  have  no  interpretation  as  a 
relation.  Tritchler  and  Lockwood  (1988)  show 
that  each  focal  element  can  be  interpreted  as 
a  partial  order,  and  suggest  that,  for  reasons 
of  interpretaUiility,  we  calculate  beliefs  and 
plausibilities  only  for  subsets  of  L  which 
correspond  to  a  partial  order.  This  indicates 
that  we  should  calculate  the  basic  probability 
numbers  for  the  focal  elements,  and  from  them, 
calculate  beliefs  and  plausibilities  for 
selected  partial  orders.  To  this  end,  they 
characterize  the  focal  elements  which  are 
contained  in  or  intersect  with  a  given  partial 
order . 

We  can  describe  the  calculation  of  the 
basic  probability  numbers  in  the  following 
way.  The  analysis  proceeds  by  first  forming 
Bel”  cUid  then  combining  Bel”  with  Bel'-.  This 
Cem  be  done  in  two  steps  (Tritchler  2md 
Lockwood,  1988).  At  step  1,  form  the  set  T  of 
all  unions  of  the  form  u  Tij,  where  Tij  is 

<  1  ,  3  >  *5 

the  canonical  graph  of  a  focal  element  of 
Bel”i3.  r  consists  of  all  focal  elements  of 
Bel”,  represented  by  their  canonical  graph.  At 
step  2,  first  the  conflict  K'-  is  calculated  as 
the  reciprocal  of  one  minus  the  sum  of  the 
basic  probability  numbers  for  all  elements  of 
r  containing  a  cycle;  those  elements  are 
deleted  from  T;  and  the  basic  probability 
numbers  of  the  remaining  elements  of  P  are 
normalized  by  the  lactor  K*-.  Hext  the 
transitive  closure  of  each  element  of  0  is 
computed  ^md  duplicates  are  eliminated, 
summing  the  basic  prob2d}ility  numbers  of 
identical  closures.  We  denote  the  resulting 
set  of  graphs  by  P'^ .  This  step  corresponds  to 
the  orthogonal  sum  of  Bel”  and  Bel*-.  The 
complexity  of  the  algorithm  is  fl  N1.3  ,  where 

<  j  >  *8 

Nij  is  the  number  of  focal  elements  of  Bel”i.3. 

The  computation  will  be  more  efficient  if 
we  can  induce  the  deletions  of  cycles  smd 
duplicates  from  P  earlier.  To  this  end, 
recall  that  each  focal  element  ^  of  Bel”  can 
be  written  as  ♦  =  ♦ra  x  ^13  x...x  ^,.-1  ^  where 
^1.3  is  a  focal  element  of  Bel”i.3.  5uppose 
that  “  ♦la  X  ♦la  X...X  ♦ah,  for  some  g,  h, 
g<h<n,  is  such  that  G(^i)  contains  a  cycle. 
Then  each  focal  element  of  Bel”  of  the  form 
♦  =  n  ♦i.j  will  be  associated  with  a  cycle 

<  1  ,  j  > •» 

for  all  choices  of  ♦ij,  where  is  a  focal 
e'crent  of  Bel”ij,  and  B  ^  S-{(1,2) , (1,3) , . . . , 
(9>h)}.  The  collection  of  such  focal  elements 
will  thus  contribute  total  probability 

M”(ij>3a)  M”(4>i3)  ...  M”(4'ah) 
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to  the  probability  mass  for  the  null  set  when 
Bel°  is  combined  with  Bel*-.  Thus  if  we  check 
for  cycles  while  calculating  Bel°,  we  may 
prune  focal  elements  with  cycles  as  soon  as 
they  appear  in  the  orthogonal  sum. 


Similarly,  assume  that 

X  ♦ia  X  ...  X  and 

♦a  =  ♦ia  X  ifrfa  X  . . .  are  are  such  that 
G‘(i|»x)  =  G'^C^a).  Then 


GS<(>x  n  <|>X3)  =  (G(4ix)  u  G(  n 

=  (G’-(it>x)  U  G’^(  n  41x3))*^ 

=  (G-^ClIla)  u  GM  n  4lxj))'^ 

<  1  .  j  >  aB 

=  (GC^a)  u  G(  n  •t'xj))'^ 

^  1.  a  J  >  aB 

=  G«^(4ia  n  4ix,) 

< 1 , X  >  «a 

lor  any  choices  of  the  focal  elements  of  ^xx 
of  Bel°xx  for  (l,j)cB.  Thus  we  may  combine  4x 
and  ^a  into  a  single  focal  element  with  basic 
probability  number  m“(+x)  +  nP(^a).  In  fact, 
the  above  arguments  show  that  we  may  always 
represent  e.g.,  ♦x  by  G*^(^x).  This  will 
identify  duplicates  and  cycles  (a  cycle  will 
result  in  G'  not  asymmetric)  at  that  step  in 
the  orthogonal  sum. 


4.  Computing  Beliefs  About  the  Best  Object 

We  cam  reduce  the  computational  complexity 
of  the  analysis  if  we  are  interested  in 
choosing  the  single  best  object,  the  best  2 
objects,  or  in  general  the  set  of  1  most 
favoured  objects.  Let  Zx,  Za,  ....,  Zx  be  the 
strong  components  partitioning  of  the 
dominance  data.  Define 

In*  “  j)#  ai,axfZni,  i  ^  j}*  m— l,...,k 

and  let 

lo  =  {S  ”  Ix  ~  la  *  ...  “  Ik). 


Define  Bel^’n*  to  be  the  orthogonal 

sum  of  the  single  support  functions  over 

ddm),  m=l,...,k  and  let 

Beln,  =  Bel°„te(s)  ®  Bel--. 


Then 


Bel  =  Bel°  ®  Bel’ 

=  Bel-'ote(S)  ffl  Bel-’xte(S)  ®... 
ffl  Bel-’K+ecs)  ®  Bel-- 

»  [Bel-“otS(S)eBel'-]  ffi  [Bel->ot0(S)®Bel--] 
ffl...®  [Bel°ote(S)fflBel--], 

-  Belo  ffl  Belx  ffl... ffl  BelH 

by  the  Idempotence  of  Bel--. 

To  restrict  our  attention  to  hypotheses 
about  the  most  highly  ranked  objects,  we 
define  a  partition  of  L: 

T  »  {Tx,  Ta,  ...»  To} 


where 

Tx  =  {6;  9  6  L  and  ax  is  rcUiked  highest  in  9}. 
Tx  is  a  partial  order  with  canonical  graph 
Hi  =  {ax-*ax,  i?*j}. 

The  belief  function  of  interest  is  BeliT, 
the  coarsening  of  Bel  to  the  partition  T.  The 
computational  complexity  is  reduced  if  we 
calculate  BeloiT  ffl...®  Belic4'T,  but  we  must 
verify  that  this  calculation  agrees  with 
Bel4'T.  If  it  does,  then  we  say  that  T 
discerns  the  Interaction  of  the  Bel-’x, 
i-0,l,...,k  relative  to  itself,  using  Shafer's 
terminology.  Shafer  and  Logan  (1987)  give  a 
criterion  for  assuring  this  discernment:  for 
any  choice  of  0x,  i=0,l,...,k  where  6s.  is  a 
focal  element  of  Belx  and  any  TxcT, 

So  n  8x  n  . . .  n  8k  n  Tx  =  0  implies  8x  n  Tx=0 
for  some  i. 

Theorem  1.  T  discerns  the  interaction  between 
Belo,  Belx,  ...,  Belle  relevant  to  itself. 

Proof:  We  may  write  a  focal  element  of  Belx  as 

«x  =  [4>x  X  e(S-Ix)]  n  L 

for  ^x  a  focal  element  of  Bel-^x .  Then  by  the 
independence  of  the  frames  6(Ix),  l=0,l,...,k, 

do  n  dx  n...n  di.  =  4o  X  ♦x  X...X  4k  n  L 

where  4  4o  x  4xX...x  4k  Is  a  focal  element 
of  Bel-’  with  canonical  graph  G(4)  =  G(4o)  u 
G(4i)  u,..u  G(4k)  for  G(4x)  the  canonical 
graph  of  4x. 

First  consider  the  case  for  which  do  n  dx  n 
...ndK=4nL=0.  Tr itchier  and  Lockwood 
(1988,  Theorem  1)  show  that  4  "  L  =  0  iff  G(4) 
contains  a  cycle,  which  iiqplies  a  cycle  in 
some  G(4i)  by  the  definition  of  strong 
components,  so  dx  =  0  and  dx  n  Tx  0. 

Next  consider  the  case  d  =  do  n  dxn...ndK  ^  0, 
d  n  Tj  =  0.  Tx  =  (L|Hj)  =  (e(S)lHi)  n  L.  Let 
4  =  (0(5)  |g)  be  ^my  focal  element  of  Bel-’  such 
that  4  n  L  =  d.  Mote  that  G  =  G(4o)  u  G(4x)  u 
...u  g(4k)  for  some  choice  of  4o,4i,...,4k 
where  4x  is  a  focal  element  of  Bel-’x .  Then 
d  n  Tx  =  0  implies 

4nLnTx=4n  (0(S)|Hx)  n  L 

=  (0(S)  G)  n  (0(S)|Hx)  n  L  =  0 
=  (0(S)  G  u  Hx)  n  L  =  0, 

SO  GuHx  contains  a  cycle.  But  then  adding  Hj 
to  G  creates  a  cycle,  since  G  is  acyclic. 

Thus  G  must  contain  some  arc  aK-’ax  incoming  to 
ax>  where  Bk  is  either  in  the  strong  component 
Zk*  containing  ax  or  is  in  some  other  strong 
component.  In  the  first  case  G(4fn)  n  Tx  =  0, 
and  in  the  second  case  G(4o)  n  Tx  =  0. 

lor  a  given  Atcong  component  Zn*,  the 
calculation  of  BelmiT  can  be  done  over  the 
frame  0(ln).  To  show  this,  explicitly  express 
the  operation  of  coarsening  Bel.K  to  the 
partition  T:  m„iT(V)  =  Y.  ■d(B)  where  the 
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sumnatlon  Is  over  all  focal  elements  B  of  Beln 
such  that  V={T3;  TjnB/0}.  We  czm  write  BnTj^^ 
as 

(l1g(B))  n  (lIHj)  =  (e(S)|G(B))  n  (e(S)(H:,)  n  L 
=  (e(S)|G(B)  n  Hj)  n  L  /  0, 

which  occurs  iff  adding  H:)  to  G(B)  creates  a 
cycle.  This  condition  is  con^jletely 
determined  by  arcs  in  6 (In.),  and  our 
calculations  can  be  done  over  e(ln.). 

There  is  a  further  result  leading  to  the 
efficient  calculation  of  Belo.  It  states  that 
each  Bel’’].:),  (i,j)eIo  can  be  coarsened  to  T 
before  combining. 

Corollary  1.  BeloiT  =  Bel°oAT.  Further,  T 
discerns  the  interaction  among  Bel^i^teCS) , 
(i,j}eIo  relevant  to  itself. 

Proof:  Mo+T(V)  =  I  Mn(B), 

A«W<  V> 

where  W(V)={B;B  a  focal  element  of  Belo  such 
that  V={Ta;  T3nB^0}}.  Thus 

MoiT(V)  =  I  K*-o  I  M^oCtl)  =  I  M^oCi*), 

a«W<V)  ^«Z(V> 

where  Z(V)  =  {<ji;  ^  a  focal  element  of  Bel“o  such 
that  V={T3;  T3n^nLj«0}},  since  K'-o  =  1  is  the 
conflict  over  lo.  Noting  that  Tj  n  l  =  Tj,  we 
see  that  Mo*T(V)  =  M°o+T(V).  The  discernment 
of  interaction  follows  from  an  argument 
similar  to  the  second  case  of  the  proof  of 
Theorem  1. 

An  algorithm  based  on  the  above  results  is: 
ALGORITHM  1: 

1°  Calculate  the  commonality  function  Qo 
for  Belo'i’T 

2°  For  each  In.,  ffl=l,2...,k 

3°  Calculate  the  basic  probability 

function  m„  for  Belm  by  the  method  of 
section  3. 

4°  Coarsen  nwn  to  T  (this  can  be  done 
concurrently  with  step  3®). 

5®  Calculate  Qm  for  Belm+T  from  nVniT. 

6®  For  each  subset  B  of  T,  calculate 
k 

Q(B)  «  n  Qi(B). 

1-0 

7®  Calculate  BeliT  cuid  the  associated 
plausibility  function  P14.T  from  Q. 

The  above  algorithm  calculates  beliefs  for 
all  (8)  subsets  of  T.  If  only  certain  subsets 
are  of  interest  it  might  be  more  efficient  to 
calculate  tim  orthogonal  sums  using  basic 
probability  functions  instead  of  commonality 
furctions. 

It  is  instructive  to  consider  1°,  the 
calculation  of  Qo,  separately.  Each  Bel^ij, 
(i,j)eIo  must  be  a  simple  support  function, 
otherwise  both  ai-*a)  and  aj-*a.  are  in  the 
dominance  data  and  ai  and  aj  would  be  in  the 
same  strong  component.  For  each  ai«A,  let  us 
collect  all  the  which  prefer  some  aj  to 


ai,  coarsen  them  to  T,  emd  take  the  orthogonal 
sum  (justified  by  Corollary  1).  The  result  is 
a  simple  support  function  with  focus  T  -  {Ti}. 
We  thus  obtain  simple  support  functions  over  T 
of  the  form  SUPi(T  -  {tJ)  =  Si,  i=l,2,  ...n. 

By  Corollary  I,  their  orthogonal  sum  will  be 
(Bel°  e  Bel'-j+T. 

The  conflict  between  the  SUPi  is  one.  To 
see  this,  note  that  a  null  intersection  of 
focal  elements  of  the  SUPi  is  possible  only  if 
T  -  {Ti}  is  a  focal  element  of  Supi  for 
1=1, 2,..., n.  But  this  Implies  that  each  ai 
has  an  incoming  arc  in  lo,  implying  a  cycle 
auid  thus  contradicting  the  definition  of 
strong  components.  Thus  combining  the  SUPi 
yields  the  commonality  function  with  simple 
form 

Qo(B)  =  n  (1-Si). 

SicB 

Further,  by  Barnett  (1981), 

Plo(B)  =  1  -  n  Si. 

SitB 

Thus  if  the  set  of  comparisons  has  no 
circularities,  so  lo  =  S  amd  Plo  =  PI,  the 
computations  for  each  subset  of  T  are  of 
complexity  linear  in  n. 

5.  Monte  Carlo  Method 

Let  H  =  {Hi,  H*,  ...,  Hp}  be  a  set  of 
partial  orders  which  are  hypotheses  of 
interest.  For  example,  we  could  have  H  =  T,  T 
defined  as  in  section  4.  We  wish  to 
approximate  Bel(Hi),  Pl(Hi),  and  the  conflict 
K.  Interpreting  a  focal  element  F  as  a  random 
subset  with  probability  m(F),  the  Monte  Carlo 
procedure  is  apparent  from  the  graphical 
formulation  in  section  3  and  is  given  below  as 
an  algorithm. 

ALGORITHM  2: 

Initialize  M=0  Z=0 

1®  Repeat  H  times:  initialize  R=0,  G=0 
2®  For  each  (i,j)  e  S: 

3®  With  probability  iiP(4ii.3)  randomly 
select  a  focal  element  from  the 
focal  elements  of  Bel°t]. 

4®  Add  m°(Fi3)  to  the  set  R. 

5®  If  is  a  singleton  add  the 
corresponding  arc  to  G. 

6®  Calculate  G®  and  m  =  n  r. 

r  alt 

7®  M  =  M  +  m. 

8®  If  0"^  contains  a  cycle  add  m  to  Z. 

9®  If  G*^  is  cycle-free,  then  for  Ht«H, 
i=l,2,...p 

10®  If  Hi  is  a  subgraph  of  G®,  allocate 
m  to  Bel(Hi) . 

11®  If  aduing  GvKi,  co  does  no., 
create  a  cycle,  allocate  m  to 
PL(Hi). 

12®  Set  K  =  (M-Z)-^.  K  is  the  conflict. 

13®  For  i=l,2,...p  set  Bel(H,)  =  K  Bel(Hi), 
PL(Hi)  =  K  PL(Hi). 

Step  10®  is  justified  by  Lemma  4  of  Tritchler 
and  Lockwood  (1988).  Step  11®  is  testing  for  a 
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non-null  Intersection  of  the  focal  element 
with  the  hypothesis  Hi. 

6.  Computing  Beliefs  cmd  Plausibilities  for 

Rankings 

Suppose  our  interest  is  focused  on  rankings 
over  A,  i.e.  singleton  subsets  of  L.  We  assume 
that  we  have  approximated  the  conflict  K  by 
the  method  of  section  5,  or  used  Theorem  2  of 
Tritchler  and  Lockwood  (1988)  to  simplify  the 
exact  calculation.  We  also  assume  that  we 
have  preprocessed  the  data  so  that  for  each 
frame  6i:]  we  have  at  most  two  belief 
functions.  One  is  a  simple  support  function 
focused  on  {ai->a:)},  while  the  other  is  focused 
on  {aj->ai}.  Let  SUPi,  SUPa,...,  SUPk  be  the 
simple  support  functions  so  defined,  where 
mi (A)  =  Si  for  A  the  focus  of  SUPi  over  some 
frame  6ki  .  Then  the  commonality  function  for 
Bel  is 

N 

Q  =  K  Q*-  n  Qi  ,  (1) 

±  — X 

where  Qi  is  the  commonality  function  for 
SUPitecs)  and  is  the  commonality  function 
for  Bel^.  For  a  singleton  set  0«L,  Q^(9)  =  1, 
and 

1-Si  if  the  focus  of  SUPi  is  the 
reversal  of  an  arc  in  G(8), 

Qi({«})  = 

1  otherwise. 

Thus,  since  {«}  is  a  singleton,  PL({fl})  =  Q{0} 
is  easily  computed  when  $el. 

We  can  compute  Bel  for  singletons  more 
efficiently  if  we  can  identify  those  rankings 
with  zero  belief  without  actually  calculating 
their  belief.  The  following  definition  amd 
theorem  enaUile  us  to  do  this.  A  Hamiltonian 
path  is  a  path  of  length  n  where  n  is  the 
number  of  objects  in  A,  and  no  object  is  on 
the  path  twice. 

Theorem  2.  is  a  focal  element  of  Bel°  and  9 
a  singleton  focal  element  of  Bel  such  that 
G‘(it>)  =  G(«)  iff 

1)  G(^)  is  acyclic  and 

2)  there  is  a  Hamiltonian  path  in  G(4>). 

Proof:  First,  suppose  that  acyclic  G($) 
contains  a  Hamiltonian  path.  Thus  for  any 
pair  (ai,a]),  either  ai  is  reachable  in  G(<^) 
from  a^  or  vice  versa.  By  Theorem  2  of 
Tritchler  and  Lockwood  (1988),  9  such  that 
G''(^)  =  G(P),  is  a  focal  element  of  Bel,  and 
by  Theorem  5.4  of  Harary  et  al  (1965),  G*'(^), 
<s  the  canonical  graph  of  a  linear  ordering, 
i.e.  a  singleton  focal  element. 

Next  assuioe  that  9  is  a  singleton  focal 
element  of  Bel  and  G’^C^)  =  G(9).  Clearly  G(^) 
is  acyclic.  To  establish  2),  we  define  the 
score  Si  of  ai  to  be  the  number  of  arcs  in 
G(9)  of  the  form  ai-^a^.  Since  9  is  a  linear 
ordering 


{si,S2,...,s„}  =  {0,l,...,n-l} 

ai-*a3  is  in  G(9)  (Moon,  1968,  Theorem  9). 
Choose  ai  and  a;)  so  that  Si  =  k  euid  S;)  =  k-1, 
and  assume  ai-ta^  is  not  in  G(<l>) .  Since  ai-*aj 
is  in  G*(i)>)  there  must  be  a  path  ai-*aii-*. .  .-♦ai 
in  G(^)  by  Theorem  5.4  of  Harary  et  al  (1965). 
But  then  Si-Si  >  2  since  G*(^)  is  transitive, 
a  contradiction,  so  ai-»ai  must  be  in  G(i|>). 

Thus  G(i^)  contains  a  Heuniltonian  path  which 
corresponds  to  the  remking  of  the  ai  in  9. 

Since  {ai->a:i)  Cctn  be  a  focal  element  of 
Bel°ti,  and  thus  an  arc  in  some  G(^),  iff  ai 
was  preferred  to  a:i  on  at  least  one  occasion, 
we  can  enumerate  Hamiltonian  paths  in  the 
dominance  data  to  find  singleton  focal 
elements  of  Bel. 

When  a  Hamiltonian  path  W  is  found,  Bel  for 
the  corresponding  singleton  focal  element  is 
calculated  by  enumerating  all  acyclic  such 
that  G(^)  contains  W.  Specifically, 

Bei(9)  =  K  n  Si  n  (i-Si)  =  pi(9)  n  Si 

where  P  =  {i;  the  focus  Fi  of  SUPi  is  an  arc 
in  w}  and  R  =  {i;  the  focus  Fi  of  SUPi  is  of 
the  form  a:)->ai,  where  ai  preceeds  a:)  on  the 
path  w} .  To  see  this,  divide  the  simple 
support  functions  into  3  classes  corresponding 
to  P,  R,  and  the  complement  C  of  PuR.  By 
Theorem  2,  for  an  intersection  <{>  of  focal 
elements  of  the  SUPi  to  satisfy  G''(4>)  =  G(9), 
the  focal  elements  Fi,  ieP  must  be  in  the 
intersection.  Also,  no  focal  element  from  R 
Com  then  be  in  the  intersection  since  that 
would  create  a  cycle  in  G(:^).  Any  combination 
of  focal  elements  from  C  in  the  intersection 
gives  G*'(^)  =  G(9),  since  the  transitive 
closure  of  the  arcs  in  W  determines  a  complete 
graph,  so  we  have 

Bel(9)  = 

n  Si  nd-Si)  I  m(Fi)m(F2)...m(FK), 

i.-  i.n  ■'i,"!  -.•'h 

where  F,  is  a  focal  element  of  SUPi,  ieC. 

Since  the  arcs  corresponding  to  focal  elements 
from  C  are  in  a  subgraph  of  the  cycle-free 
complete  graph  G(9),  no  choice  of  Fi,F2, . . .  ,Fic 
can  yield  a  null  intersection,  so  the  above 
summation  reduces  to  1. 

The  problem  of  enumerating  Hamiltonian 
paths  is  NP-hard.  We  can  prune  the  search  for 
Hamiltonian  paths  by  using  (1)  to  restrict  our 
search  of  rankings  of  high  plausibility.  As 
each  arc  is  added  to  a  candidate  path  the 
partial  product  corresponding  to  (1)  is 
checked.  If  it  falls  below  a  threshold  a,  the 
attempt  to  complete  that  path  is  abandoned. 
This  restricts  the  search  to  rankings  of 
plausibility  greater  than  o. 
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Fusion  and  Propagation  in  Graphical  Belief  Models 
Russell  Almond,  Harvard  University 


ABSTRACT 

This  paper  demonstrates  the  potential  of  graphical  belief  function 
models  in  decision  problems.  The  working  of  a  simple  example 
problem  illustrates  the  basic  procedures  involved  in  calculating 
marginal  and  conditional  beliefs  in  a  complex  system.  First,  graph¬ 
ical  modeling  techniques  break  the  example  down  into  a  series  of 
small  relationships,  linked  in  a  model  hypergraph.  Next  the  re¬ 
lationships  between  the  attributes  (variables)  of  the  problem  are 
expressed  as  belief  functions.  A  simple  procedure  (Kong[l986b|) 
transforms  the  model  hypergraph  into  a  irtt  of  cliques.  This  is  a 
tree  of  ‘‘chunks*  of  the  original  problem;  the  information  in  each 
cli^iue  can  be  combined  independently  of  all  other  cliques  except 
its  neighbors.  Each  node  in  the  tree  of  cliques  passes  messages 
(expressed  as  belief  functions]  to  its  neighbors  consisting  of  the 
local  information  fused  with  all  the  information  that  has  propa¬ 
gated  through  the  other  branches  of  the  tree.  This  propagation 
algorithm,  along  with  the  fusion  algorithm  given  by  the  direct  sum 
operator,  can  easily  compute  marginal  beliefs,  and  can  save  consid¬ 
erable  computational  cost  over  the  brute  force  approach.  Finally, 
the  paper  explores  new  methodology  for  presenting  the  results  of 
the  computation. 

Key  Concepts:  Graphical  Models,  Belief  Functions,  fiaycs- 
ion  Models,  Fusion  and  Propagation,  Probability  in  Expert  Systems, 
Triangulated  Graphs. 

1.  Attaching  Large  Problems 

Many  large  problems,  of  a  type  that  occur  frequently  in  expert 
systems,  involve  a  large  number  of  variables  ;»nd  complex  infor¬ 
mation  about  the  relationships  among  those  variables.  These  are 
not  the  classical  statistical  problems  of  estimating  parameters  from 
repeated  observations,  but  instead  require  combinations  of  evi¬ 
dence  from  diverse  sources  to  reach  conclusions  about  the  plau¬ 
sibility  of  certain  events.  Thus  a  decision  maker,  using  one  of 
these  models,  requires  marginal  information  about  certain  events 
or  groups  of  events.  Graphical  models  are  a  clear  and  concise 
way  of  describing  problems  of  many  variables  with  dependencies 
among  those  variables.  The  variables,  or  attributes,  become  nodes 
in  a  model  hypergraph  and  are  joined  by  hyperedges.  Only  re¬ 
lationships  among  variables  which  all  share  a  common  hyperedge 
must  be  modeled,  considerably  simplifying  both  the  modeling  and 
the  computational  task.  Graphical  models  have  been  studied  by 
pearl  [1986a, 1986b|,  Mous80uris|l974|,  and  Lauritten  and  Spiegel- 
halter|1988|  in  the  Bayesian  case,  and  Kong|1986a|,  Shafer,  Shenoy, 
and  Mellouli  |1986|  and  Shenoy  and  Shafer|l986|  in  the  belief  func¬ 
tion  case. 

Probability  distributions  are  not  quite  flexible  enough  to  model 
the  more  complex  interactions  that  can  take  place  among  attributes 
(variables)  in  one  of  these  models.  Belief  functions,  usually  repre¬ 
sented  by  the  set  function  BEL,  are  a  generalization  of  probabil¬ 
ity  that  allow  ways  to  express  total  ignorance,  Bayesian  probabil¬ 
ity  distributions,  conditional  probability  distributions  (likelihoods), 
logical  relationships  (production  rules)  and  observations.  All  these 
diverse  types  of  knowledge  can  be  combined  with  a  uniform  fusion 
rule,  the  direct  sum  operator.  Belief  functions  can  be  simply  re¬ 
stricted  to  a  smaller  frame  and  easily  extended  to  a  larger  frame 
without  adding  additional  information.  Shafer[l976,1982|  devel¬ 
ops  the  theory  of  belief  functions  and  Kong[l986a|  summarizes  it. 
Belief  functions  provide  a  more  flexible  modeling  tool  than  prob¬ 
abilities,  but  their  computational  cost  rises  quickly  with  the  size 
of  the  problem.  The  problem  must  be  subdivided  into  manage¬ 
able  chunks  before  it  is  solved.  Thus  graphical  models  and  belief 
functions  work  well  together. 

For  example,  consider  a  problem  with  N  attributes  or  vari¬ 
ables,  and  M  independent  relationships  among  groups  of  those 
attributes.  Let  6  be  the  discrete  joint  outcome  space  for  those  N 


variables,  and  the  relationships  among  those  variables  be  modeled 
as  a  belief  function  over  ©.  To  compute  the  combined  belief  func¬ 
tion,  BEL^,  which  in  turn  yields  marginal  belief  functions  for  any 
attribute  or  group  of  attributes  of  interest,  the  computational  cost 
is  M  For  the  special  case  where  all  of  the  attributes  are 

binary  variables,  the  cost  is  M  -  2^  • 

The  high  potential  computation  cost,  due  to  the  large  size  of 
0,  makes  the  direct  computation  of  BEL^  impractical.  However, 
the  following  strategy,  which  I  have  implemented,  makes  the  com¬ 
putational  tasks  manageable: 

1.  Break  up  the  problem  using  the  Graphical  Modeling  and  con¬ 
ditional  independence  assumptions,  as  described  in  Section  2. 

2.  Locally  model  relationships  with  Belief  Functions.  This  pro¬ 
cess  will  be  briefly  described  in  Section  3. 

3.  Re-express  the  graphical  model  as  a  Tree  of  Cliques  (  see 
Dempster  and  Kong[l988]  and  Kong{l986b]),  The  tree  of 
cliques  will  be  described  in  Section  4. 

4.  Propagate  and  Fuse  local  information  to  find  margins  of  the 
total  belief  function.  This  will  be  described  in  sections  5  and  6. 

5.  Now  compute  and  examine  any  desired  marginal  belief  func¬ 
tion.  This  will  be  described  in  section  7. 

These  procedures  reduce  the  computational  costs  dramati¬ 
cally.  Returning  to  the  above  example,  let  m  be  the  number  of 
nodes  in  the  Tree  of  Cliques  (m  >  M  but  only  slightly),  and  k  be 
the  maximum  number  of  neighbors  of  a  clique  in  the  tree.  Further¬ 
more,  let  C*  be  the  largest  clique  in  the  tree,  n  be  the  number  of 
variables  in  C*,  and  ©cr*  be  the  outcome  space  associated  with  C*. 
Then  the  computational  costs  are  no  more  than  m  k  or  for 

the  case  of  binary  variables,  m  fc  2^  .  In  most  cases  n  c  AT;  this 
reduces  the  size  of  the  double  exponential  and  yields  a  large  savings 
in  computational  time.  Furthermore,  these  computational  costs  are 
worst  case  figures,  based  on  arbitrarily  complex  belief  functions.  In 
practice,  with  simple  belief  functions,  the  computational  costs  will 
be  much  smaller. 

This  paper  will  illustrate  the  strategy  by  following  its  appli¬ 
cation  to  a  simple  example.  Consider  the  reasoning  by  which  the 
Captain  of  a  ship  decides  how  many  days  late  her  ship  will  arrive 
in  port.  The  first  step  in  reasoning  about  the  Captain’s  Decision 
is  to  define  the  attributes  (variables)  of  the  problem.  The  goal  is 
to  find  the  Arrival  delay,  or  by  how  many  days  the  ship  will  be 
delayed  (for  simplicity  assume  it  will  be  an  integer).  This  delay 
is  the  sum  of  two  attributes:  the  Departure  delay  and  the  Sailing 
delay.  Before  the  ship  leaves  port  it  could  be  delayed  for  Loading 
problems;  a  Forecast  of  foul  weather  could  cause  the  Captain  to 
delay  departure;  and  Maintenance  could  cause  the  ship  to  sit  at 
the  dock.  For  simplicity  as»ume  that  each  of  these  three  factors 
delay  departure  by  one  day.  Therefore  the  total  departure  delay 
could  be  up  to  three  days.  Similarly,  bad  Weather  cn  route  could 
cause  delays,  as  could  needing  to  make  Repairs  at  sea.  These  de¬ 
lays  contribute  to  the  Sailing  delays,  again  an  integer  number  of 
days. 

2.  Graphical  Models 

Graphical  models  provide  a  way  of  organizing  information  about 
the  relationships  among  variables  in  problem  domains  with  many 
variables.  ^Decision  problems,  diagnosis  problems,  fault  trees,  log- 
linear  models,  and  expert  systems  all  fall  into  this  category.  Thus 
the  graphical  model  is  a  form  of  knowledge  representation,  and  a 
graphical  model  design  very  much  resembles  relational  data  base 
design. 

Breaking  a  large  problem  (the  complete  model)  into  a  series  of 
smaller  problems  is  the  essence  of  graphical  modeling.  A  model  hy¬ 
pergraph,  5*  organizes  the  pieces  of  the  large  problem.  Each  node 
of  the  model  hypergraph  is  an  attribute  or  variable  of  the  problem 
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Each  edge  of  the  model  hypergraph  corresponds  lo  a  group  of  at* 
tributes  that  are  related  through  some  mechanism,  modeled  with 
the  methods  of  Section  3.  Any  pair  of  attributes  (nodes)  which 
are  not  directly  connected  are  assumed  to  be  conditionally  inde* 
pendent  under  the  Markov  conditions  (see  below).  This  last  part 
is  important.  Any  group  of  attributes  that  share  a  common  hy> 
peredge  will  have  a  belief  function  mechanism  which  models  their 
relationship,  yet  any  two  nodes  which  are  not  directly  connected, 
will  have  no  such  mechanism.  Instead,  their  relation  will  be  im¬ 
plicit  in  the  way  they  are  connected  through  other  nodes.  Thus 
the  models  of  small  problems  with  the  graphical  structure  linking 
them  from  the  model  of  the  large  problem.  Furthermore,  indepen¬ 
dence  conditions  may  be  easier  to  elicit  from  an  expert  than  joint 
distributions(Pearl|l982|).  Thus  building  a  graphical  model  limits 
the  size  of  the  joint  distributions  (or  complex  mechanisms)  which 
experts  must  provide. 

For  example,  Figure  1  shows  the  model  hypergraph  for  the 
Captain's  Decision,  with  the  first  letter  of  each  attribute  name 
representing  the  node. 


Figure  1.  Model  Hypergraph  for  the  Captain*s  Decision 

The  second  step  in  building  a  graphical  model  is  to  construct 
the  edges.  The  edge  containing,  L,  D,  F  and  M  represents  the  pre¬ 
viously  noted  relationship  between  those  attributes,  namely  that 
[joading,  weather  Forecast  and  Maintenance  delays  could  each  add 
one  day  to  our  total  Departure  delay.  Similarly,  the  edge  contain¬ 
ing  S,  W  and  R  expresses  of  a  similar  rule  for  delays  that  occur  at 
sea.  The  edge  containing  A,  D  and  5  represents  the  logical  assump¬ 
tion  that  the  total  delay  equals  the  sum  of  the  two  partial  delays. 
The  edge  containing  F  and  W  represents  the  condition  that  the 
Weather  and  the  Forecast  are  dependent  variables,  and  similarly 
the  edge  between  M  and  R  graphically  depicts  the  dependence  be¬ 
tween  Maintenance  and  Repairs  at  sea.  Lastly  the  three  singleton 
edges  containing  L,  F  and  M  respectively  represent  the  Captain’s 
prior  beliefs  about  Loading,  weather  Forecast,  and  Maintenance. 

There  is  a  one  to  one  correspondence  between  the  edges  of  the 
hypergraph,  and  the  component  belief  functions  of  the  graphical 
model.  Summing  all  of  those  belief  functions,  although  compu¬ 
tationally  intensive,  yields  a  complete  picture  of  the  interaction 
among  all  attributes  of  the  problem.  However,  as  a  part  of  parti¬ 
tioning  the  total  belief  function  into  its  components,  we  arc  also 
making  certain  assumptions  about  the  conditional  independence 
of  the  attributes.  These  are  the  Markov  conditions,  developed  by 
Moussouris[l974)  for  the  probabilistic  case  and  extended  here  for 
belief  functions  and  hypergraphs.  (The  formal  definition  of  a  belief 
function  BEL(A)  will  follow  in  section  3.  For  the  purposes  of  the 
discussion  in  this  section,  one  can  think  of  belief  functions  as  some 
generalization  of  probabilities  and  still  follow  the  arguments). 

Markov  Conditions.  Let  $  be  a  hypergraph.  For  any  pair  of 
nodes  X  and  Y  of  such  that  X  and  Y  do  not  share  a  common 
hyperedge,  let  S  be  a  set  of  nodes  through  which  all  paths  from 
X  to  Y  pass.  For  any  such  pair,  liEL(X  j  5,  V')  —  BEL(X  |  5) 
and  BEL(K  1  5,  X)  =  BEL(K  |  5),  That  is  X  is  independent  of 
Y  given  S.  Then  §  will  be  considered  a  Markov  lly^icrgraph  and  is 
said  to  satisfy  the  Markov  ConditxoTvs. 

Tlie  the  belief  functions  of  the  Captain's  Decision  hypergraph 
(Figure  1)  follow  these  Markov  conditions.  Consider  the  two  at¬ 


tributes  R  and  D.  {S,  F,M}  are  one  set  of  attributes  which  sepa¬ 
rate  the  two  target  attributes.  Thus  given  S,  F,  and  M  fixed,  R 
and  D  are  independent.  This  makes  sense  in  terms  of  the  original 
model,  since,  if  we  know  the  Sailing  delnv  the  Forecast  for  the 
weather,  and  the  Maintenance  record  at  the  dock,  it  is  plausible 
that  the  Departure  delay  and  the  Repairs  at  sea  are  independent. 

Although  the  model  hypergraph  visually  suggests  many  inde¬ 
pendence  conditions,  not  all  of  the  independence  conditions  fall 
neatly  into  the  graphical  model.  For  example,  if  the  Departure  de¬ 
lay  is  not  fixed,  it  is  plausible  that  the  Loading  and  Forecast  are 
independent.  However,  because  their  relationship  represents  one 
process,  modeled  with  one  belief  function,  they  all  share  a  com¬ 
mon  hyperedge.  It  is  even  possible  to  go  further,  and  to  model  the 
relationship  among  several  variables  with  a  vacuous  belief  function 
(thus  implying  that  they  are  totally  independent).  Although  this 
makes  the  graphical  model  a  less  useful  picture  of  the  problem,  it 
could  be  a  useful  technique  for  assessing  the  importance  of  depen¬ 
dency  assumptions.  Conversely,  once  two  attributes  are  marked  as 
conditionally  independent  in  the  graphical  model,  there  is  no  way 
to  add  a  dependence  between  them,  without  adding  a  new  edge  to 
the  model 

8.  Belief  Functions 

Here  I  will  briefly  discuss  why  belief  functions  are  attractive  tools 
for  representing  uncertainty  in  networks.  Definitions  and  detailed 
descriptions  can  be  found  in  Shafer(l976]  or  Kong{1986a].  I  will 
only  give  an  abstract  of  the  ideas  here. 

Think  of  a  set  of  possible  outcomes  0  =  of  an 

experiment.  Now  given  a  subset  A  of  the  possible  outcomes,  define 
BEL(A)  as  the  belief  (a  number  between  0  and  1)  that  the  true 
outcome  will  be  in  set  A.  With  probabilities,  one  normally  thinks 
of  placing  a  mass  function  on  the  possible  outcomes.  With  belief 
functions  the  mass  function,  m(B),  places  mass  on  elements  of 
the  power  set,  2®,  of  the  outcome  space,  that  is  on  subsets  of 
the  possible  outcomes.  We  normally  restrict  outselves  to  belief 
functions  over  discrete  outcome  spaces.  The  total  mass  is  always 
one.  For  a  normalized  belief  function,  the  mass  on  the  empty  set  is 
always  zero.  Elements  of  the  power  set  which  have  non-zero  mass 
are  called  focal  elements.  Equation  1  relates  the  mass  function  to 
the  belief  function. 

BEL(/1)  =  Y,  m(B)  (1) 

The  plausibility  of  A,  PL(A),  is  1  —  BEL(A),  where  A  is  ihe 
complement  of  A  with  respect  to  ©.  Furthermore  belief  functions 
are  superadditive;  that  is,  if  A,  B  C  0  and  A  n  B  =  0,  then 
BEL(A)  +  BEL(B)  <  BEL(A  U  B).  Note  that  this  last  rule  is 
a  generalization  of  the  corresponding  case  for  probabilities. 

The  mass  function  of  a  belief  function  over  a  binary  outcome 
space  is  particularly  easy  to  interpret.  For  example,  suppose  the 
outcome  space  is  ©  =  {F,  T}  where  F  represents  fair  weather,  and 
T  represents  foul  weather.  Consider  the  belief  function  with  the 
following  ntass  function: 

m({F))  =  0.6 

m({T})  ^0.2  (A.2) 

m{e)  ^  0.2 

We  can  interpret  this  as  either:  (1)  There’s  a  20%  chance  of  bad 
weather,  a  60%  chance  of  fair  weather  and  20%  chance  of  unpre¬ 
dictable  weather,  or  (2)  There’s  a  20-40%  chance  of  bad  weather. 
In  terms  of  betting  odds,  by  this  belief  function  I  would  be  com¬ 
fortable  betting  with  odds  better  than  1:4  that  there  will  be  foul 
weather,  or  betting  against  foul  weather  with  odds  of  better  than 
3:2.  I  am  indifferent  to  (that  is  to  say  1  would  not  take  either  side 
of)  any  bets  witliin  that  region.  It  is  useful  lo  think  of  the  mass 
placed  on  a  given  focal  element  (that  is  a  subset  of  the  outcome 
space  which  has  non-zero  mass)  as  the  weight  of  evidence  that  sug¬ 
gests  the  outcome  will  be  in  the  focal  element,  and  that  cannot  be 
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divided  (because  of  our  ignorance)  into  finer  divisions.  As  another 
example,  we  might  think  of  the  belief  function  given  in  Equation  2 
as  an  urn  containing  black  balls  and  white  balls.  With  probability 
0.2  we  draw  a  black  ball,  with  probability  0.6  we  draw  a  white  ball. 
With  probability  0.2  we  draw  a  ball  which  looks  grey  in  this  light, 
and  we  cannot  determine  its  color  without  further  experiments. 

Belief  functions  have  certain  advantages  over  probabilities  for 
modeling  relationships  in  graphical  models: 

1.  Upper  and  lower  probabilities .  A  belief  function  provides  a  two> 
value  assessment  of  uncertainty,  the  belief  (BEL)  or  lower  proba¬ 
bility  and  the  plausibility  (PL)  or  upper  probability,  where  a  prob¬ 
abilistic  model  would  provide  a  single  number.  Belief  functions 
express  uncertainty  about  the  chance  of  an  event  occurring  in  sim¬ 
ple  way. 

2.  Can  be  used  to  represent:  Bayesian  probabilities,  logical  rules, 
observations,  and  ignorance.  A  Bayesian  belief  function  has  all  its 
mass  on  the  singleton  subsets  (representing  the  elements)  of  the 
outcome  space.  A  “Vacuous”  belief  function,  one  with  all  its  mass 
on  the  frame,  0,  provides  an  unambiguous  definition  of  ignorance, 
unlike  a  so-called  non-informative  prior.  A  belief  function  with  all 
of  its  mass  on  a  single  focal  element  can  act  as  either  an  observation 
(if  that  focal  element  is  a  singleton)  or  a  logical  rule  (if  the  focal 
element  is  more  complex).  Furthermore,  dividing  the  mass  between 
the  frame  and  one  other  focal  element  produces  a  belief  function 
which  expresses  partial  support  for  a  rule  or  an  observation.  (These 
operate  in  a  way  that  the  MYCIN  (Buchanan  and  Shortlifell084]) 
authors  wanted  their  certainty  factors  to  work).  Belief  functions 
incorporate  both  set  and  probability  theory,  and  mix  logical  and 
probabilistic  knowledge  in  a  single  uniform  framework. 

S.  Belief  functions  easily  marginalize  to  smaller  frames  or  vacu¬ 
ously  extend  to  larger  frames.  Merely  projecting  the  focal  elements 
unto  a  smaller  or  larger  frame  trivially  marginalizes  or  extends  a 
belief  function.  Note  that  the  former  is  true  of  probabilities  as  well 
as  belief  functions,  but  the  latter  b  not. 

4-  fusion  rule.  The  direct  sum  operator,  0,  (Dempster(1968), 
Shafer|1976|)  involves  both  set  intersection  and  multiplication  of 
probabilities  with  renormalization.  It  b  a  generalbation  of  both 
logical  and  Bayesian  inference  rules. 

Belief  functions  generally  yield  a  great  deal  of  flexibility  at  the 
cost  of  more  complexity  in  both  notation  and  computation.  All  the 
examples  in  (2)  above  ?»-e  easy  to  specify.  Furthermore  for  a  binary 
variable,  the  belief  function  is  easy  to  interpret  through  its  upper 
and  lower  probability  functions.  The  binary  variable  case  seems  to 
be,  in  general,  a  case  were  the  Bayesian  description  yields  too  few 
parameters  (we  must  choose  a  single  probability)  or  ebe  too  many 
(we  must  specify  a  prior  distribution  over  all  possible  probabilities 
that  the  value  will  be  true). 

On  the  other  hand,  as  shown  in  section  7,  a  general  belief  func¬ 
tion  is  often  difficult  to  describe  or  interpret,  simply  because  of  the 
large  numbers  of  sets  of  outcomes  to  examine.  A  belief  function 
over  a  frame  with  n  possible  outcomes,  has  2”  different  outcome 
sets  which  could  be  assigned  positive  mass.  Fortunately,  for  model¬ 
ing  purposes,  graphical  modelers  can  usually  restrict  themselves  to 
the  simple  belief  functions  they  do  understand.  If  they  can  not,  the 
problem  may  require  further  subdivision  into  simpler  problems — 
another  graphical  modeling  process. 

Another  aspect  of  the  complexity  of  belief  functions  Is  that 
they  are  difTicult  to  elicit  from  experts.  While  there  is  much  litera¬ 
ture  on  the  general  problem  of  eliciting  probabilities  from  experts, 
no  one  as  of  yet  has  examined  the  problem  for  the  more  complex 
cases  of  belief  functions.  For  the  present,  graphical  modelers  must 
rely  on  the  simpler  and  more  easily  understood  special  cases. 

If  the  input  belief  functions  in  the  model  are  all  Bayesian 
probabilities,  then  the  marginal  belief  functions  resulting  from  this 
computtiticn  will  be  probabilities  too.  Thus  in  a  general  way,  ev¬ 
ery  thing  we  discuss  here  applies  to  probabilities  as  well  as  belief 
functions.  Doing  the  mathematics  in  the  belief  function  notation 
helps  us  to  understand  what  is  happening  without  worrying  about 
difficult  technical  details  of  extending  probability  di.<tributions. 


4.  Making  the  Tree  of  Cliques 

From  the  model  hypergraph,  $,  choose  a  collection  of  sets,  C  = 
of  p's  attributes,  A  =  {v4i,...An}.  (Recall  that 
attributes  are  nodes  of  t.hi>  model  hypergraph.  I  wil'  deliberately 
use  the  term  “attributes"  here  to  avoid  confusion  with  “cliques," 
the  nodes  of  the  tree  of  cliques.)  If  the  model  hypergraph  is  trian¬ 
gulated  (acyclic),  then  the  sets  C'  will  be  the  cliques  (maximally 
complete  subgraphs)  of  the  model  hypergraph.  If  the  model  is  not 
triangulated,  a  procedure  given  in  Kong(l986b]  (also  in  Appendix  II 
of  Almond[l988])  produces  these  sets.  The  Kong  procedure  implic¬ 
itly  fills  in  hyperedges  to  create  a  triangulated  graph;  the  C^’s  are 
cliques  of  the  triangulated  graph.  The  Kong  procedure  also  con¬ 
nects  the  cliques  to  form  a  tree,  called  the  Tree  of  Cliques.  The 
connections  to  satisfy  a  separation  property  which  will  be  given 
below.  For  computational  purposes,  the  tree  of  cliques  is  easier  to 
use  than  the  model  hypergraph  (Dempster  and  Kong|l088j). 

A  useful  way  to  think  about  the  tree  of  cliques  is  to  consider 
each  clique  to  be  a  group  of  attributes  within  which  some  complex 
interaction  takes  place.  This  complex  interaction  will  be  mod¬ 
eled  by  a  belief  function  representing  the  information  local  to  that 
clique,  and  by  messages,  also  in  the  form  of  belief  functions,  passed 
from  the  neighboring  cliques  in  the  tree.  Calculations  are  per¬ 
formed  by  propagating  messages  between  the  nodes  via  the  schema 
given  in  Section  5  and  by  fusing  the  messages  via  the  schema  given 
in  Section  6.  The  result  is  a  belief  function  representing  the  margin 
of  the  graphical  belief  function  for  each  of  the  margins  C*. 

In  order  to  make  the  computations  more  modular,  we  augment 
the  tree  of  cliques  by  adding  each  of  the  original  edges  of  the  model 
hypergraph  to  C  (new  nodes  in  the  tree  of  cliques)  and  connecting 
each  new  node  to  any  clique  that  contains  it.  (Note  that  every 
hyperedge  will  always  be  contained  in  at  least  one  .  lique).  We  can 
also  augment  the  tree  of  cliques  by  adding  (as  a  node)  any  set  of 
attributes  that  is  a  subset  of  one  of  the  cliques.  In  particular,  the 
singleton  sets  of  one  attribute  are  always  a  subset  of  one  of  the 
nodes  and  are  frequently  margins  of  the  graphical  belief  function 
that  might  be  important  to  examine  later  on.  The  .  ugmented 
collection  of  sets  is  called 

Each  set,  C'  €  ,  has  a  local  belief  function,  BEL,;i,  repre¬ 

senting  the  local  information  attached.  This  local  belief  function 
can  be  easily  found  from  our  graphical  model,  providing  that  that 
every  edge  of  the  original  mode)  hypergraph  is  in  C"*".  For  every 
node,  C*,  in  the  augmented  tree  of  cliques  that  corresponds  to 
one  of  the  original  hyperedges,  BELr;*  is  the  belief  function  cor¬ 
responding  to  that  hyperedge.  For  every  other  node,  BEL,.*  is 
vacuous. 

As  noted  above,  the  tree  of  cliques  is  easy  to  build  if  the  model 
hypergraph  is  triangulated.  In  order  to  get  the  tree  of  cliques 
from  a  non-triangulated  graph,  the  Kong  procedure  fills  in  extra 
hyperedges  to  make  a  triangulated  graph.  There  is  often  more 
than  one  way  to  fill  in  the  hypergraph  and  different  fill-ins  lead 
to  different  trees  of  cliques,  some  of  which  are  better  than  others. 
Because,  as  discussed  in  the  first  section,  the  cost  of  combining 
belief  functions  is  exponential  with  the  size  of  the  largest  clique, 
trees  with  smaller  cliques  will  be  better.  The  problem  of  finding 
the  optimal  tree  of  cliques  is  NP-hard.  Kong  and  I  have  developed 
some  heuristics  for  finding  good  trees  of  cliques  that  seem  to  do 
well.  (Given  in  Appendix  II  of  AlmondjlQSfl]). 
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Let  us  return  to  the  Captain’s  Decision  problem.  Figure  2 
reproduces  the  model  hypergraph  from  Figure  1.  Notice  that  some 
new  edges  (the  dashed  lines)  ha\e  been  implicitly  filled  in  as  part 
of  the  construction  procedure.  Figure  3  shows  the  tree  of  cliques. 


Fxgure  S.  Augmented  Tree  of  Cliques 

The  nodes  {5,  D,  M,  F),  {L,  D,  A/,  f),  {R,S,M,W], 
{i4,S,  D)  and  {S,  Af,  W,  F)  are  the  cliques  of  the  graph  in  Fig¬ 
ure  2.  The  nodes.  {/4,5,D},  {L.D.M.F},  {fi,  A/},  {  ^.W./?).and 
{ly,  F}  all  correspond  to  original  edges  of  the  model  hypergraph. 
These  latter  nodes  are  loaded  with  the  belief  functions  correspond¬ 
ing  to  those  edges.  Furthermore,  {A/},  [L],  and  {F}  all  have 
univariate  prior  belief  functions  associated  with  them,  which  are 
also  loaded  into  the  appropriate  cliques.  The  edffes  [S,  D,  M,  F), 
{5,  M,  W,  F}  and  {fl,  5,  A/,  W}  are  all  associated  with  the  filled  in 
edges.  They  have  vacuous  belief  functions  associated  with  them. 
The  remaining  nodes,  corresponding  to  the  remaining  singleton 
edges,  also  have  vacuous  belief  functions  associted. 

The  nodes,  C* ,  of  the  tree  of  cliques  arc  connected  so  as  to 
satisfy  the  separation  property  (Kong[l986bl).  This  is  also  true  of 
the  augmented  tree  of  cliques. 


Figure  4-  Propagating  Inwards 

root  {5,  Af,  IV,  F).  When  {5,  A/,  W,  F}  receives  all  of  its  incoming 
messages,  outward  propagation  can  occur.  The  flow  then  follows 
backwards  through  the  arrows  of  Figure  4. 

Incidentally,  there  is  nothing  in  this  discussion  which  is  spe¬ 
cific  to  belief  functions;  belief  functions  only  provide  a  convenient 
uniform  notation  for  both  the  local  information  and  the  messages. 
This  passing  messages  in  a  tree  of  cliques  works  equally  well  in  the 
special  case  when  all  the  belief  functions  are  Bayesian  probability 
distributions. 

6.  Fusion 

Now  let  us  turn  to  the  details  of  what  happens  inside  each  clique 
to  '‘fuse”  incoming  messages  to  produce  the  outgoing  messages.  At 
this  moment,  a  single  nr^de,  C*,  which  has  neighbors  C* , . .  . ,  C*. 
has  received  messages  from  all  but  fhe  last  of  its  neighbors.  This 
is  shown  in  Figure  5. 

The  node  C*  must  now  compute  the  message,  BEL, 
that  is  to  be  passed  to  the  remaining  clique  Equation  (1)  shows 
the  calculation  that  C*  does. 


Separation  Property.  Given  two  nobles  in  C* ■ 

c*  ={.4. .4,  J 
C’  =  {-4, . 

Let  C*  be  any  clique  lying  on  the  path  between  them.  Then: 

C‘  nC^  C  C‘ 

The  separation  property  also  implies  that  the  subgraph  of  the 
tree  of  cliques  consisting  of  all  cliques  that  contain  a  given  attribute, 
or  set  of  attributes  it  will  be  connected.  This  is  not  obvious,  but 
can  be  seen  after  examining  figure  3  for  a  few  minutes. 

5.  Propagation 

At  the  heart  of  the  computational  system,  the  cliques  (nodes)  of  the 
tree  pass  messages  among  themselves  (Dempster  and  Kong[l988|). 
This  message  passing  system  propagates  the  local  information 
(which  makes  up  the  graphical  model)  to  global  information  which 
can  be  used  to  answer  questions  about  the  process  being  modeled. 
Their  system  operates  as  follows. 

The  cliques  pass  “messages"  to  their  neighbors  in  the  tree  of 
cliques,  in  the  form  of  belief  functions.  Define  BEL,:.^,;»  to  be 
the  belief  function  passed  from  C*  to  C^.  Its  franie  corresponds  to 
C"  nC^.  Each  clique  “fuses*  its  incoming  messages  with  its  local 
information,  and  “propagates*  the  results  as  its  outgoing  message. 
The  fusion  step  will  be  described  in  detail  in  the  next  section. 

When  the  node  C*  has  received  messages  from  all  its  neighbors 
except  C^,  it  can  calculate  the  message  BELC  =>  and  pass  it 
to  C^.  Therefore  the  outermost  leaves  of  the  tree  can  immediately 
pass  their  messages  inwards.  The  outermost  cliques  (the  leaves  of 
the  tree)  pass  their  information  toward  the  center  (root).  When  all 
the  information  reaches  the  center,  the  cliques  in  the  center  start 
passing  messages  back  towards  the  outside  (leaves). 

Let  us  illustrate  this  with  an  example.  Figure  4  shows  the 
tree  of  cliques  propagating  their  messages  inwards,  towards  (he 


BEL,..^,,.* 


[bel,.. 


(1) 


The  message  passed  is  just  the  sum  of  the  incoming  messages  from 
all  of  the  other  cliques,  BEL, ’s,  with  the  local  information, 
BEL,**,  stored  at  that  clique. 

Each  of  these  calculations  L«  done  over  the  frame  corresponding 
to  C*,  and  then  the  result  is  projected  onto  the  frame  corresponding 
to  C*.  [Note;  the  computational  cost  is  then  <  m  k  2  as 

s>''en  in  Section  1.]  A  clique  can  pass  a  message  as  soon  as  it  has 
received  messages  from  all  but  one  of  its  neighbors.  In  particular, 
the  leaves,  which  only  have  one  neighbor,  can  immediately  pass  a 
message  to  that  neighbor.  Therefore  the  fusion  and  propagation 
algorithm  is: 

•  Starting  with  the  leaves,  tlie  nodes  propagate  messages  in¬ 
wards  until  the  messages  reach  the  root.  At  each  successive 
stage,  the  inner  nodes  receive  all  of  their  iticoming  messages 
and  can  pass  towards  the  center. 

•  At  this  point,  the  root  has  received  messages  from  each  of  its 
neighbors,  thus  it  can  pass  outwards  in  all  directions,  calculat¬ 
ing  its  messages  by  equation  (1);  each  time  a  clique  calculates 
a  message,  it  omits  the  destination  from  the  sum.  When  a 
clique  receives  a  message  from  the  center,  it  now  has  all  of  its 
incoming  messages  and  can  send  messages  to  each  of  its  more 
outward  neighbors.  This  is  continued  until  the  leaves  receive 
their  messages. 

At  this  point  each  of  the  cliques  has  a  series  of  incoming  mes¬ 
sages  describing  the  contribution  of  the  other  parts  of  the  tree  to 
the  total  belief  function.  If  we  wish  to  view  the  margin  of  the  to¬ 
tal  belief  function,  BEL^,  corresponding  to  a  given  clique.  C*,  we 
simply  sum  all  of  the  incoming  messages  with  the  local  component, 
as  shown  in  equation  2. 


BKL, 


BKL,-.  e 


0BEI.c.„..j 


(2) 
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BEL,;. 

C‘ 

BEL« 

J  BEL^i 

BELcsiX';* 

BEL(;k-l:4,<;* 

c 

C* 

BEL,.. 

BEL,;, 

BEL,..-. 

T-" 

T-" 

T" 

.  .  |...  t... 

Figure  5.  Messages  passed  to  and  from  node  C* 


The  separation  property,  mentioned  in  Section  4,  assures  us 
that  the  marginalization  done  in  each  step  does  not  lose  any  infor¬ 
mation.  Thus  it  enables  calculation  over  smaller  frames,  at  consid¬ 
erable  savings  in  time.  That  this  procedure  is  correct  follows  from 
Kong  [  1 986a,  1986b|. 

A  similar  procedure  can  be  used  for  sensitivity  analysis.  If  we 
modify  o:.e  of  the  component  belief  functions,  the  modified  node 
in  the  tree  of  cliques  simply  passes  new  set  of  messages  outwards 
to  the  others.  The  messag 's  passed  inwards  toward  the  modified 
clique  will  be  unchanged,  and  can  be  re-used.  The  same  margins 
are  examined  before  and  after  modification,  and  the  change  indi¬ 
cates  the  sensitivity  of  those  marginal  beliefs  to  the  assumption 
represented  in  the  changed  component. 

7.  Examining  the  Results 

The  output  of  the  fusion  and  propagation  algorithm  is  the  conflict, 
or  the  extent  to  which  the  belief  functions  are  inconsistent,  and  the 
marginal  belief  functions  for  each  clique  (or  singleton  element)  in 
the  tree.  For  the  Captain’s  Decision  example,  the  belief  function 
over  /^rrival  Delay  is  of  special  interest.  We  will  examine  both  the 
conflict  and  the  Arrival  Delcy  here. 

The  total  conflict  is  mass  that  would  be  assigned  to  the  null 
set,  if  the  direct  sum  operator  did  not  renormalize  the  belief  func¬ 
tions.  It  varies  from  0  to  1,  and  indicates  to  what  extent  the 
component  belief  functions  ar^  inconsistent,  or  how  much  mass  is 
placed  on  contradictory  cases.  In  this  case  the  coi.flict  is  zero.  Thus 
conclusions  are  formed  from  reinforcing  positive  evidence,  rather 
than  by  ruling  out  contradictory  possibilities. 

Recall  tliat  Arrival  Delay  has  seven  possible  values,  0^  “ 
(0,  1, 2,  3,  4,  5,  6}.  Belief  functions  are  defined  over  the  power  set 
of  the  outcome  space,  so  the  belief  function  takes  on  128  values. 
One  way  of  cutting  down  the  information  to  a  manageable  size 
is  to  only  observe  the  focal  elements-,  iliat  is  the  sets  of  possible 
outcomes  which  are  making  a  positive  contribution  to  our  belief 
(those  elements  with  a  non-zero  mas?).  There  are  21  of  these  as 
.shown  in  table  (l).  The  mass  numbers  represent  support  for  the 
true  outcome  being  in  a  given  set  (focal  element),  In  our  example, 
there  is  relatively  small  support  for  any  given  day,  but  there  is  large 
support  for  focal  elements  representing  large  sets  of  days. 

The  raw  focal  elements  are  dilHcult  to  interpret,  as  is  generally 
true  for  non-binary  variables,  Let  us  try  to  n  ake  some  meaningful 
summaries  of  the  results. 

First,  look  at  the  lower  and  upper  expectati  iis  for  the  arrival 
lime.  Erpiafjon  3  gives  the  formula  for  calculating  them, 

K‘(/l)  =  ^  m('/)  niir,(j-)  ^ 

E,(/l)  “  inax(z)  -  0.824 

lu  A  ^ 

T  hat  means  that  on  average  (in  a  scrits  of  hypothetical  but  never 
realize  I  trials),  the  ship  will  be  between  1  ind  2  days  late. 


i-iass 

Focal  Element 

Mass 

Focal  Element 

0.04 

(0) 

0.01 

{■1} 

0.07 

{1} 

0,01 

{'1.2} 

0.16 

{1.0} 

0.03 

{■1,3} 

0.04 

{2} 

0.03 

{4,3,2} 

0.01 

{2,0) 

0.07 

{4.3,2,  1} 

0.12 

{2,1} 

0.04 

{4,3,2,  1,0} 

0.09 

{2,1,0} 

0.01 

{5, 4, 3, 2} 

0.02 

M} 

0.01 

{3,  4,  3,  2,  1} 

0.02 

{3,1} 

0,001 

{5, 4,3, 2,  1,0} 

0.06 

{3,2} 

0. 

{6,  5,  4,  3,  2,  1,0} 

0.04 

{3,2,1} 

=  6;, 

0.10 

{3,2, 1,0} 

Table  1.  Focal  elements  on  Arrwal  delay 

Obviously,  looking  at  beliefs  and  plausi*  ilities  for  all  128  sub¬ 
sets  of  04  would  be  exhausting,  but  looking  at  a  carefully  chos**n 
batch  of  those  subsets  cuuld  reduce  the  task  considerably.  One 
such  group  of  sets  is  the  batch  of  &ingleton  sets  corresponding  to 
each  day.  Figure  6  shows  graphicly  the  beliefs  and  plausibilities  for 
theses  sets.  The  beliefs  (lower  probabilities)  are  the  S'"’:.!  lines  and 
the  plausibilities  (upper  probabilities)  are  the  dotted  line. 


n  <!•  Oi  oi  <«t  I'l 


Figure  6.  Single  Day  Beliefs  for  Arrival  delay 

Another  group  of  interesting  sets  of  arrival  days  is  the  batch  of 
propositions  that  the  ship  will  arrive  befoi a  certain  day,  or  after  a 
certain  day.  Forex.unple,  {0,  1,2}  would  represent  less  than  3  days 
and  (3.  4,5,6}  would  represent  at  least  3  days.  These  are  shown 
ill  Figures  7a  and  7h.  Because  of  the  relationships  between  beliefs 
and  plausibilities,  Figure  7a  is  the  same  as  Figure  7b  when  turned 
upside  down  and  the  doited  and  solid  lines  are  reversed 
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Figure  7a.  Fewer  Man  n  days  Figure  7b.  Greater  than  n  days 

for  Arrival  delay  for  Arrival  delay 

Work  on  useful  summaries  is  slill  at  a  very  preliminary  stage. 
The  great  flexibility  that  was  an  advantage  in  specifying  the  model 
can  become  complexity  in  examining  the  results.  Clever  ways  of 
summarizing  tl»e  resulting  information  then  become  necessary. 

8.  Conclusioua  and  Future  Directiona 

•My  experience  with  both  the  Captain’s  example,  and  the  approx¬ 
imately  2,500  lines  of  LISP  code  that  implement  the  general  algo¬ 
rithms  described  here,  suggests  that  Graphical  Belief  Models  can 
successfully  model  uncertainty  in  practical  decision  problems.  The 
algorithms  meittioncd  here  are  easy  to  code  and  to  use.  Sensitivity 
analysis  is  simple  lit  ihL^  systetn,  and  the  theoretical  framework  is 
very  attractive. 

The  simple  example  illustrates  how  to  partition  a  problem  into 
small  pieces  using  graphical  models,  and  how  to  model  various  re¬ 
lationships  among  the  attributes.  Re-arranging  the  model  hyper¬ 
graph  into  a  tree  of  cliques  allows  calculation  of  the  composite  belief 
function’s  margins  by  a  simple  message  passing  algorithm.  Finally, 
the  simple  belief  function  inputs  give  rise  to  the  complex  belief 
function  shown  in  taljle  1.  Aliliough  the  calculation  of  BEL{A) 
and  PL(^)  for  any  given  A  C  0  is  simple,  this  belief  function  may 
not  be  easy  to  comprehend  and  must  be  summarized.  The  methods 
used  in  section  7  are  only  a  few  of  the  ways  we  might  interpret  the 
results  of  the  example 

The  system  has  a  number  of  strengths; 

1.  The  graphical  modeling  is  simple  to  understand.  It  also  forces 
the  user  to  be  precise  about  the  relationsliips  among  the  at¬ 
tributes.  This  may  seem  like  a  drawback  to  people  who  are 
used  to  adding  rules  in  a  fast  and  loose  manner  to  a  PROLOG 
database,  but  actually  it  forces  one  to  model  the  interactions 
correctly  in  the  planning  stage,  rather  than  tunc  them  by  ex¬ 
tensive  debugging.  Of  course  sensitivity  analysis  methods  can 
be  applied  for  fine  tuning  or  in  cases  where  there  is  uncertainly 
about  the  model. 

2.  The  belief  functions  are  a  flexible  tool  for  modeling  many  types 
of  relationships  among  the  variables  of  a  problem. 

3  The  fusion  and  propagation  algorithm  lowers  the  computa¬ 
tional  cost  of  calculation,  making  large  problems  tractable. 
Unfortunately,  the  system  also  has  some  weaknesses.  For  large 
outcome  spaces,  a  general  belief  function  (such  as  the  one  given 
in  table  1)  can  be  a  much  more  complex  than  a  probability  dis¬ 
tribution  over  the  same  space.  Furthermore,  belief  functions  are 
unfamiliar  objects  and  we  do  not  have  the  wealth  of  methods  for 
interpretation  and  anecdotal  experience  that  we  do  with  probabi* 
ities.  The  conflict  that  occurs  in  assembling  the  composite  beliel 
function  is  almost  certainly  a  useful  tool  for  discovering  what  is 
happening  within  a  graphical  model.  On  the  other  hand,  we  have 
very  little  experience  with  evaluating  what  conflict  means.  Work 
on  the  interpretation  cf  belief  functions  is  just  beginning. 

Although  the  fusion  and  propagation  algorithm  successfully 
breaks  modest  sized  problems  into  tractable  pieces,  a  very  large 
problem,  such  as  the  fault  tree  analysis  for  a  nuclear  power  plant, 
will  require  new  approaches  Supercompuling  might  help.  The  di¬ 
rect  sum  operate*-  is  very  amenable  to  vectoriz.ilioii,  possibly  tak¬ 
ing  ad-.a..vagc  of  hypercubic  notalions  for  sets,  while  the  mes.‘»age 


passuig  algorithm  described  above  would  work  well  on  a  variable  ar¬ 
chitecture  machine,  such  as  the  connection  machine.  Furthermore, 
methods  of  modularizing  graphs  (breaking  very  large  problems  into 
more  modestly  sized  pieces  on  wluch  the  fusion  and  propagation 
algorithm  will  work)  to  make  large  problems  solvable  on  small  ma¬ 
chines  need  to  be  explored. 

The  next  stage  in  this  research  is  to  u.‘?e  the  system  to  work 
larger  examples.  This  will  help  to  develop  both  the  tlieory  and 
the  algorithms  to  accommodate  tlie  new  examples.  Tlie  meth¬ 
ods  described  here  make  belief  function  methodology  accessible  for 
practical  problems;  its  real  potential  has  yet  to  be  realized. 
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Variants  of  Tierney-Kadane 

G.  Iv'eiss  &  H.  A.  Howlader 

of  Winnipeg,  ^inniptg.  Manitoba 

Abstract 

Bayes  estimation  of  the  reliability  function  of  the  logistic 
distribution  under  a  log-odds  squared  error  loss  with  a  non-informative 
prior  Is  considered  by  using  the  approximation  method  of  Tierney  & 
Kadane  (1986).  Direct  application  of  the  procedure  does  not  yield 
correct  results  and  so  some  variations  of  the  procedure  are  considered . 


1.  Introduction 

In  Bayesian  estimation  it  is  often  necessary  to 
evaluate  the  ratio  of  two  integrals  which  cannot  be 
expanded  into  closed-form  expressions.  Numerical 
approximation  of  the  ratio  of  the  integrals  is 
necessary.  Two  recent  procedures  to  achieve  this 
approximation  have  been  proposed  by  Lindley  (1980)  and 
by  Tierney  &  Kadane  (1986).  The  Lindley  (19S0) 
procedure  is  fairly  well-behaved  and  can  be  applied  in 
all  situations.  The  procedure  of  Tierney  &  Kadane 
(1986)  works  well  in  certain  situations,  such  as,  for 
computing  the  posterior  means  of  the  parameters  of 
probability  distribution,  and  gives  results  which  are 
more  accurate  than  the  method  of  Lindley  (1980).  For 
approximations  from  small  samples  or  over  restricted 
spaces,  however,  the  Tierney  &  Kadane  (1986) 
procedure  can  be  very  erratic  and  may  lead  to 
incorrect  results.  See  Howlader  &  Weiss  (1987,  1988). 

!n  parr'^ular,  the  method  does  poorly  in  cases 
where  the  integral  in  the  numerator  ranges  over 
positive  and  negative  values.  In  this  paper,  we 


by  using  the  method  of  Tierney  &  Kadane  11986),  using 
the  non-informative  prior  p(p,  oj  oc  i.  Combining  this 
prior  density  with  the  likelihood  function, 


L'.m,  <7|zi  =  (^)  fl  (1  +  cosh  f|] 

the  joint  posterior  density  of  u  and  o  is 


T'./i,  cr|x’  '.x  ct’"'”*''  n  7-  cosh 


;3) 


-4) 


where  f,  =  ^  -^). 


For  most  statisticians,  interested  mainly  in 
controlling  the  amount  of  variability,  it  has  become 
standard  practice  to  consider  squared-error  loss 
functions.  In  the  case  of  estimating  a  reliability,  the 
usual  squared-error  loss  does  not  seem  appropriate  as 
the  reliability,  which  is  a  probability,  is  contained  in 
the  closed  interval  10,  11,  and  hence  the  'distance'  from 
the  true  value  is  bounded.  One  remedy,  is  to  first 
compute  the  log-odds  ratio  of  the  probability,  which 
maps  the  |0,  II  interval  onto  the  entire  real  line.  It 
would  thus  be  reasonable  to  use  the  squared-error  of 
the  log-odds. 


consider  one  such  instance,  in  the  estimation 

reliability  function  of  a  logistic  distribution. 

of  the 

Loss' r,,  R.i  =  log  (j 

_ 

—  r , 

J 

.  '5'i 

The  Bayes  estimator,  R 

.  of  R.  under 

this 

loss 

function  is  the  value  ' 

dT  r 

.  which  minimizes 

the 

2.  Bayes  estimation 

posterior  risk,  E'LossIr.,  R,l|X|, 

such  that 

1  he  logistic  probability  density  function 

written  as 

may  be 
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log  ( 
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/  r 

which  gives 
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whore 

Ihen, 

'■  ,  3,  w  IS  the  mean,  and  a  is  the  standard 

deviation  of  the  distribution. 

Here,  we  consider  the  Bayesian  estimation  of  the 
reliability  function, 


K  log  I 


R. 


R, 


^  <J 


// 
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Tie.  ney  &  Kadane  (1986)  gave  a  method  of 
evaluation  of  the  ratio  of  integrals,  such  as  the 
posterior  mean  of  a  function  u(0),  which  has  the  form 


E[u(B)|x)  = 


J  w(fl)ir(fl|x)dfl 
J  ir(,€llpc)dO 


by  writing 

«  1  ,  ^  ,  log  p(0)  +  log  L(flfr) 

e  =  ^log  nB\pc)  =  — - Tj— = - 


pi - L|(n  +  1)  -  E{2f.(p(ri'  + 

da  n<T  [  J 

Similarly,  setting  ^  and  respectively  to  0  gives 


=  n+1— >7E,  (16) 


where  f  =  ('  +  ®  ") 


r  =  £  +  iioguioi  =  m  +  ^og.L(e^)^ 


Thus,  (9)  takes  the  form 


Elw(0)|xl  = 


dO 


Tierney  &  Kadane  (1986)  claim  that  their  method 
derives  from  an  approximation  method  due  to  Laplace. 
Whereas  Lindley  (1980),  in  a  similar  approximation, 
expands  both  the  numerator  and  the  denominator  of  (11) 
about  a  common  point  [the  mle  or  posterior  mode], 
Tierney  &  Kadane  (1986)  expand  each  integral 
separately  about  the  point  which  maximizes  the 
integrand.  This  method  requires  only  the  first  and  the 
second  derivatives  of  the  posterior  density.  Following 
Tierney  &  Kadane  (1986),  the  equation  (11)  in  the  multi¬ 
parameter  case  takes  the  approximate  value 

E(u(0)|x]  =  cxp{nl£*(0’)  -  £(^,  (12) 

where  0*  and  0  maximize  £*  and  £,  respectively,  and  1* 
and  I  are  negatives  of  the  inverse  Hessians  of  €*  and 
£  at  0*  and  0,  respectively. 

To  apply  the  method  in  the  logistic  case,  we 
need  to  maximize 

£  =  -(1  T  i)(og  er  -  i  2;  logd  f  cosh  (,), 
and,  |13) 

£*  =  -  i(og[l  fc'’] -T +i)log  CT  - 

where  ^  =  c(^  ^|. 

fif  ^P 

Sett.ng  f—  and  respectively  to  zero  produces 

a/J.  0(7 

the  system 

=  0,  and  =  n  +  U  d^) 

which  produces  the  posterior  mode.  Also, 


3/i  no' 


-9-L  = 

3p3tT  no'- 


^  {iHm  -  Z  7(f,)}, 

dfi  no‘ 

(17) 

=  _]_J(n  +  1)  _  {2f,v>(fi)  +  f?T(f.'} 

atr-  no*  I  '  1 

-  {2r7£  l!7^7(i7)}|. 

The  procedure  of  Tierney  &  Kadane  (1986)  is 
difficult  to  apply  directly.  The  procedure  requires 
that  the  integrals  in  (11)  be  strictly  positive,  and  the 
procedure  should  only  be  applied  when  the  integrals 
are  not  near  zero.  In  (8),  6  will  be  positive  if  R.  is 
greater  than  (i.e.  t  >  ii\  otherwise  it  is  negative. 

If  the  value  of  t  is  fixed  at  some  value  away 
from  the  mean,  u,  then  the  integrand  will  be  either 

positive  It  >  (i)  or  negative  (t  <  u),  almost  surely.  In 
this  case,  the  procedure  can  be  applied  by  applying  the 
method  to  the  positive  integrand  (taking  absolute 
value)  and  determining  the  sign  afterwards.  If  the 
value  of  (  is  fixed  at  some  point  near  the  mean, 

however,  this  procedure  will  not  work. 

The  estimators  of  the  reliability  function  for 
10,000  simulations  of  samples  from  (1)  with  ai  =  25  end 
c  =  5  were  computed  and  the  histograms  constructed. 
Figure  I,  the  histogram  for  t  —  15,  shows  the  typical 
distribution  of  an  estimator  about  the  true  mean, 

R,.  =  0.9741.  However,  Figure  2,  the  histogram  for 
(  =  «  =  25,  shows  a  bi-modal  distribution  with  the 
estimates  being  pulled  away  from  the  true  mean, 

Rr«  ="  i- 

Although,  niost  apparent  when  f  —  u,  there  are 
similar  disturbances  to  the  distributions  of  the  estima¬ 
tors  of  R,  for  other  values  of  t  also.  In  the  following 
sections  several  variations  of  the  method  are  given. 
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3.  Variant  1 

One  way  to  remedy  the  above  situation  is  to 
shift  the  value  of  the  integrand  by  adding  the  inte¬ 
grand  to  a  large  positive  constant,  T  .»  0,  which  is  then 
removed  from  the  result  (or  subtracting,  if  t  <  wl: 


t  >u, 
■IS) 
t 


This  procedure  will  not  alter  the  maxima  of  the 
integrals  and  should  be  invariant  with  respect  to  the 
size  of  the  constant  (as  long  as  it  is  large  enough). 
However,  since  we  altering  the  value  of  the  integrand 
we  may  have  slightly  different  computational  precision 
for  different  values  of  1. 

Tables  1  and  11  give  the  means  and  the  mean 
squared-errors  (MSL)  of  the  sampling  distributions  of 
the  reliability  estimates  for  different  values  of  7,  the 
shift  parameter,  for  lUUO  samples  generated  from  a 
logistic  distribution  as  in  (1)  with  «  =  25  and  <t  =  5. 

As  expected,  slightly  differing  histograms  were 
obtained  for  different  values  of  the  shift  parameter, 
however,  such  differences  were  minimal  for  any 
reasonably  large  shift  value  (greater  than  SO).  These 


differences  were  virtually  eliminated  when  the 
convergence  criterion  for  the  iterative  procedure  was 
strengthened  from  t  =  10“^  to  «  =  10“*®. 

4.  Variant  2 

In  this  particular  instance  it  is  possible  to  re¬ 
write  the  expectation  as  a  linear  function  of 
expectations: 

«  -  ‘ E:(i|x)]  (i9i 

Now,  the  Tierney-Kadane  procedure  can  be 
applied  to  each  of  these  expectations,  since  for  u  not 
near  zero,  each  of  the  integrands  will  be  away  from 
zero.  Now,  however,  we  are  performing  seperate 
approximations,  and  hence  this  variation  of  the 
procedure  is  not  numerically  equivalent. 

Although  these  variants  of  the  procedure  are 
not  numerically  equivalent,  they  did  produce  very 
similar  histograms  in  the  simulation  studies.  See 
Figures  2 — 4.  Compare  Figure  4,  which  shows  the 
histogram  for  variant  1  with  a  shift  parameter  of  500, 
and  Figure  3,  which  shows  variant  2,  with  the 
histogram  in  Figure  2  obtained  by  direct  application  of 
the  method  (identical  to  variant  1  but  with  shift  of  0). 


Table  I 


il€ans  of  distrltutlon  of  T~K  estimator  for  <H/f«nt  shifts 


7 

t  -  15 

f  =  20 

t  -  25 

t  -  30 

(  =  35 

0 

0.98791 

0.91606 

0.49198 

0.08057 

0.01190 

50 

0.98377 

0.89322 

0.49216 

0.10304 

0.01595 

100 

0.98362 

0.89280 

0.49217 

0.10345 

0.01610 

500 

0.98349 

0.89237 

0.49218 

0.10387 

0.01623 

5000 

0.98343 

0.89214 

0.49218 

0.10410 

0.01629 

100000, 

0.97060 

0.85790 

0.49274 

0.13753 

0.02885 

100000, 
t  =ro'’® 

0.98345 

0.89288 

0.49218 

0.10396 

0.01626 

R. 

0.9741 

0.8598 

0.5000 

0.1402 

0.0259 

Table  II 


riS€'s  of  the  dIetrJbuttoa  of  T~K  esttieator  for  diiff^nt  ahlfts 


7 

t  -  15 

t  =  20 

t  =  25 

t  =  30 

f  =  35 

0 

.0004710 

.0078547 

.0536627 

.0077600 

.0004717 

50 

.0005580 

.0079320 

.0288334 

.0080704 

.0005525 

100 

.0005595 

.0079287 

.0286412 

.0080680 

.0005539 

500 

.0005615 

.0079177 

.0284799 

.0080577 

.0005558 

5000 

.0005625 

.0079181 

.0283601 

.0080589 

.0005569 

100000, 

.0009700 

.0077780 

.01%678 

.0078457 

.000950C 

100000, 

< 

.0005618 

.0079143 

.0284491 

.0080543 

.0005561 

SElMSIl* 

.000018 

.000122 

.000356 

.000103 

.000015 

'*1  These  are  the  standard  errors  of  the  MSE,  which  is  approximately 
constant  for  all  values  of  the  shift  parameter,  except  for  a  shift  of  0 
and  for  the  shift  of  100000  with  t  =  10“®  for  which  the  values  are 
slightly  different. 
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li  atondord  T-K  undar  LjQL  at  t  >  20 


rigura  2t  otondord  T-4<  Lndor  LUL  at  t 


The  first  variant,  which  is  the  technically  more 
correct  method,  does  require  a  seperate  optimization  of 
the  numerator  for  each  value  of  t  (as  in  the  standard 
T-K  procedure).  This  second  variant,  however,  is 
easier  to  apply  and  requires  the  optimization  of  only 
two  numerators  which  can  then  be  used  to  estimate  the 
reliability  function  for  all  t.  [Both  variants  also 
require  the  determination  of  the  posterior  mode,  which 
IS  used  to  evaluate  the  normalizing  integral  in  the 
denominator,  which  is,  of  course,  common  to  all  of  the 
expectations!. 

What  we  would  like  is  a  method  that  has  the 
advantages  of  both  variants. 


5; _ ‘Variant’  3 

■Another  variation  of  the  procedure  of  Tierney 
&  Kadane  (1986)  is  suggested  by  the  result  in  (19).  In 
(19),  we  arc  estimating  the  reliability  function  (2)  by 
replacing  and  %  in  the  kernel  of  reliability 

function  by  their  respective  posterior  means.  Again, 
since  these  cannot  be  obtained  directly,  each  is 
approximated  using  the  method  of  Tierney  &  Kadane 
(1986). 

Although,  this  variation  of  the  Tierney  & 
Kadane  (1986)  procedure  is  only  valid  in  this  particular 
situation,  it  does  suggest  a  method  that  might  be 
generally  applicable  whenever  we  wish  to  approximate 
the  Bayes  estimator  of  a  function  of  the  parameters  of 
the  distribution. 

The  suggestion  is  to  simply  use  the  posterior 
means  of  the  parameters  themselves  in  and  o)  and  use 
these  values  in  place  of  the  unknown  parameters. 
Again  it  may  be  necessary  to  compute  these  posterior 
means  using  the  standard  procedure  of  Tierney  & 
Kadane  (1986),  or  by  direct  computation  when  possible. 
This  procedure  is  not  really  a  variant  of  the  Tierney 
&  Kadane  (1986),  nor  is  it  even  a  true  Bayes  procedure 
in  that  the  estimator  is  not  the  minimum  of  the 
posterior  expection  (risk)  of  some  loss  function. 

Recall  that,  for  squared-error  loss,  the  minimum 
posterior  risk  estimator  (t.e.,  Bayes  estimator)  is 


r; 


E(R,|x) 


(20) 


while  the  minimum  posterior  risk  estimator  for  log-odds 
squared-error  loss  is 

R*  - - j - 1 - :.  (21) 

c(fE[l/^|x]  -  £[^.,^1*]) 

1  -t-  e 

We  propose  the  estimator 

R  _  _ 1 _  (22) 

'  c't  —  E[«|x]) /E[<7|i)’ 

1  -f-  e 

which  is  very  likely  will  not  be  the  minimum  of  the 
posterior  expectation  of  any  loss  function,  and  is  thus 
not  a  true  Bayes  estimator. 

We  generated  the  10,000  samples  of  size  10  and 
constructed  the  histogram,  shown  in  Figure  5,  for  the 
estimator  in  (22),  for  which  the  posterior  expectations 
were  approximated  using  the  method  of  Tierney  & 
Kadane  (1986).  Compare  this  distribution  with  the 
histograms  for  variants  1  and  2  shown  in  Figures  3  and 
4  and  also  with  the  histograms  for  the  Tierney  & 
Kadane  (1986)  estimator  under  squared-error  loss  in 
Figure  6.  There  is  not  a  very  great  difference  betwen 
any  of  these.  Hence,  although  (22)  may  not  be  a  true 
Bayes  estimator,  as  an  approximation  to  a  Bayes 
estimator  it  seems  to  be  a  justifiable  alternative. 

Also,  the  form  (22)  is  intuitively  attractive  in 
that  it  suggests  that,  as  in  the  case  of  the  mle,  for 
estimating  a  function  9(0),  we  can  use  the  Bayes 
estimators  of  0  in  place  of  O  in  the  functional  form. 
As  well,  the  method  is  generally  applicable  and  requires 
a  minimum  number  of  optimizations. 

Another  comparison  of  these  three  variants  is 
obtained  by  generating  the  estimated  reliability  curves 
for  a  single  sample.  A  sample  of  20  observations  from 


the  logistic  distribution 

is  obtained  as 

in  ll) 

with  u  = 

25  and  o  =  5 

10.372 

20.570 

21 .204 

21.540 

24.118 

24.256 

24.325 

24.357 

25.301 

25.344 

26.661 

26.881 

26.989 

27.377 

29.110 

29.450 

29.888 

31 .849 

31.946 

37.473 

Figure  7  shows  the  estimated  reliability  curves 
for  the  three  variants  under  the  log-odds  loss, 
together  with  the  curve  for  the  true  reliability  and 
for  the  Tierney-Kadane  estimator  under  squared-error 
loss.  There  again  appears  to  be  little  difference 
between  the  four  estimates.  In  particular,  the  curve 
for  variant  3  is  almost  identical  to  that  of  variant  1. 
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NUMERICAL  APPROACH  TO  NON-GAUSSIAN  SMOOTHING 
AND  ITS  APPLICATIONS 


Genshiro  Kitagawa,  Tlie  Institute  of  Statistical  Mathematics 


Abstract 

A  smoothing  methodology  for  the  analysis  of  time  se¬ 
ries  is  shown.  The  method  is  ijased  on  the  geiu'ral  state 
space  model  which  is  exjnes.sed  by  conditional  distribu¬ 
tions.  Various  types  of  non-Gaussian  models,  nonlinear 
models  and  discrete  variate  models  can  be  handled  with 
this  generic  model.  Recursive  formulas  for  the  prediction, 
filtering  and  smoothing  of  state  are  given  for  this  general 
state  space  model.  Unlike  the  fmniliar  linear  Gaus.sian 
state  space  model  for  which  these  formvd;is  can  be  realized 
by  simple  Kalman  filter  and  the  fi.ved  interval  smoother, 
these  formulas  are  im])leinent('d  by  using  numerical  ex¬ 
pressions  for  related  distributions.  It  thus  becomes  a  com¬ 
putationally  intensive  nu'thod  l)tU  is  very  flexil>le  and  is 
a  useful  tool  for  the  amily.si.s  of  time  series  that  have  been 
difficult  to  handle  by  the  standard  time  series  models. 
Many  numerical  exiiniples  are  shown  to  i!!n.-.tiare  the  use¬ 
fulness  of  the  gp'.icrtil  slate  sapee  modeling  and  of  general 
smoothing  algorithsn. 

1.  Introduction 

In  time  .series  modeling,  we  used  to  consider  ])arame- 
teric  models.  But  with  the  spread  of  tlu'  tipplication  of  the 
time  series  models,  the  limitation  of  the  usual  parametric 
models  ha.s  lieen  recognized  and  in  some  situations  like 
sea.sonal  adjustment,  models  with  very  many  parameters 
were  rotpiired.  But  oljviously  in  such  a  situation,  the  or- 
dinally  maximum  likelihood  method  does  not  work  since 
it  sometimes  involves  the  estiamtion  of  more  parameters 
than  the  number  of  observations. 

For  such  a  situation,  penalized  likelihood  method  might 
be  apiilied.  In  this  aiiproach,  the  crucial  problem  is  the 
selection  of  the  tradeoff  [larameter  A.  But  there  is  a 
Bayesian  interpretation  of  the  proldem  based  on  smooth¬ 
ness  prior,  and  we  can  dctermiiK'  the  tradeoff  parame¬ 
ter  by  maximizing  the  likelihood  of  the  Bayesian  modi-l 
(.4kaike,  19S0).  Tliis  smoothness  jjiior  metliod  gave  a  so¬ 
lution  to  the  many  parameter  iiroblems,  lint  this  usually 
involves  solution  of  linear  eiiuation  with  very  large  <li- 
mension.  The  u.se  of  the  state  space  model  mitigates  this 
computational  burden  and  makes  the  large  ])arameteric 
model  practical. 

In  the  modeling  of  nonstationary  time  s<’ries,  the  main 
issue  is  the  repri’sentation  of  time  I’aiying  .system.  Lin¬ 
ear  Gaus.sian  state  sjiace  model  are  very  useful  for  the 
modeling  of  gradual  cahnges  of  parameters.  For  examiile, 
sea.soanl  adjustment  problem  and  estimation  of  changing 
spectrum  can  be  treated  with  this  linear  Gaussian  moil- 


els.  But  these  linear  Gaussian  models  are  not  so  adi-cjuate 
for  the  sudden  changes  or  juni])  t'f  parameterts.  .4nd  we 
need  another  jirior  that  allow  sudden  changes  as  well  as 
gradual  changes. 

Such  a  ])rior  can  lie  well  realized  by  the  use  of  non- 
Gaussain  state  space  model.  But  this  non-Gaussain  state 
sapee  model  can  be  further  extended  to  a  geni'ral  state 
space  model  which  can  handle  very  wide  situation  includ¬ 
ing  nonlinear  and  discrete  distributions.  In  this  paper,  we 
derive  recursive  formulas  of  the  prediction,  filtering  and 
smoothing  for  this  general  state  space  model.  Lnlike  the 
familiar  linear  Gaussian  state  space  model  foi-  wliich  these 
formulas  can  lie  n'alized  by  simple  Kalman  filter  and  the 
fixed  interval  smoother,  these  formulas  are  impk'inented 
by  using  numerical  (>xpressions  for  relate^!  disiriimtions. 
It  thu.s  becomes  a  cr-mputrttionally  intensive  method  but 
is  Very  flexible  and  is  <1  useful  tool  for  the  analysis  of  time 
series  that  have  been  difficult  to  handh'  by  the  standard 
time  series  morlels,  .Mtmy  numericid  examples  are  shown 
to  illustrate  the  usefulness  of  the  general  state  sapee  mod¬ 
eling  and  of  general  smoothing  algorithm. 

2.  General  State  Space  Model  and  State 
Estimation 

Consider  a  systi.-in  described  by  a  gimeral  state  sptice 
model 

■Tn  ~  </(  •  l-fn-l) 

!/„  -  r(  ■  |.r„).  fl) 

where  y„  is  the  observation  and  is  the  unknown  state 
vector,  q  and  r  are  conditional  distributions  of  .r„  given 
r„_i  and  of  1/,.  given  .r,, .  ri'spectividy.  The  initial  .-.tati-  r  ec¬ 
tor  To  i.s  distributed  according  to  the  di.stribm  ion  /)( .(■((IK, ). 
The  set  of  observations  and  the  states.  and  .V,,,,  tire 

defineil  liy  =  (;/, . }  and  X„,  s  {.r, . r,„}.  Tlu' 

comlitiomil  <listribut  ion  of  .r„  given  .Vj,  ;md  the  K„  is 
<lenote<l  In'  /)(.r„ |.\‘r.,  The  prolilem  tif  sttitc  estimti- 

tion  can  be  formulated  as  the  evaluation  of  the 

conditiontd  distrilnition  of  ,r„  givim  observations  )]„.  For 
n  >  rn.n  =  rn  tuid  u  <  m,  this  formuhites  the  inoblems 
of  prediction,  filtering  and  Siiioothing.  resiiectively. 

The  above  genertil  state  s]>ace  model  (1)  implicitly 
assumes  the  following  Markov  proiierties: 

In_,  )  =  ;i(,r„l.r,.,.,  ) 

V.,_|  )  =  /'( ).  (2) 

Obviously,  our  gi  neinl  state  sjiace  model  includes  the 
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ordinary  linear  state  space  model 

X n  —  Fx n  —  I  4"  G  I'll 

J/n  =  Hx„  4-  !C„,  (3) 

with  Gaussian  white  noises  u„  and  ie„. 

Under  the  assumption  of  (2),  it  can  be  shown  that  the 
conditional  distribution  satisfies 

p{x„|l„+i,y}v)  =  p(Xn)x„+i,};,).  (4) 

In  Kitagawa  ( 1980,1987),  it  was  shown  that  for  the  general 
state  space  model  with  (2)  and  (4),  the  recursive  formu¬ 
las  for  obtaining  one  step  ahead  prediction,  filtering  and 
smoothing  distributions  are  (in  continuous  variate  case) 
given  as  follows: 


One  step  ahead  prediction: 


p(x„|v;.-i) 


Filtering: 


fOC 

/  p(x„, 

7-00 

/  /'(x„| 

7-00 


I  Xn -  1  i Kt—  1 
X.i-  1  .  1  ri-  1 


)d.r„_| 

)i'(x„.^i|r„-l)dx„_, 


J  '/(x„|.r„.,)/i(,r,._i|r„.,)</.r„_,  (5) 


p{  X  n  1 1  n  ) 


iH  Xn  |t/r( ,  I  n-  1  1 
/>(i/n|x„,  )Mx,.|r„_|  ) 

1 1 ’..-1  ) 

'•(i/n|Xn)Mx,.|V„_|  ) 

/'(ynl);,-!  / 


(C) 


where  )  is  obtained  by  /  !■(!/,. |x„ )/4.r„ ll',,. i  )(/,r„. 


Smootliing: 

p(x„|};v)  = 


f  7'(x,.,x„+i|}A)d,r,.  +  ,. 

J  —  'Xj 

f  7'(x„  +  i|i'v  7)(x„l.r„ti.  rv)(/.r„,, 

I  7'(x„  +  i|y'v)7Hx„|x„t|,  i;  )</.r„,  I 

,  ir-  ^  f"'  7'(x,.h|5'.v)/'('|„*i|.'„.  V„) 

7'(x.,|ljy  - - (/.r„,, 


/'(x„+iir.,) 

7'’(x„  +  i|l'.v)'/lx„^,|ix„) 
/4  't-f  i  II  »i ) 


d.r„  +  |.(7) 


These  formulas  (5).  (G)  and  (7)  show  recursive  rel;i- 
tion  between  state  distributions.  In  the  linear  G.’in.ssian 
case,  the  conditional  distributions  ;>(  i',,  11,,^  i  )■  7'(x„|T„) 
and  74x„|1'v)  an-  charticterizetl  Iw  the  means  and  the  co- 
variance  matrices  and  (5),  (G)  and  (7)  thus  are  e(|uiva- 
lent  to  the  standard  Kalman  filter  and  the  lix<-d  inter 
val  smoothing  algorithms.  In  the  general  case,  however, 
the  conditional  distribution  of  the  stale  ,,, )  becomi's 

non  Gaussian  and  cjumot  be  s|)ecified  by  u--ing  only  the 
first  two  moments.  It  thus  becomes  neci  ssary  to  use  a  nu- 
merictd  method  for  the  reali/,;ition  of  the  formulas.  This 
[joint  will  be  <-onsidere(I  in  the  next  si-ction. 

The  general  state  s|iace  model  usually  h.as  som<'  un 
known  partimeters  The  best  v.alues  of  the  [laiameters  c:ui 


be  found  by  maximizing  the  log  likelihood  defined  by 
1(0)  =  log  p(y, v) 

=  X^logp(y„|yi,...,y„-i)  (8) 

n  =  l 
N 

=--  IClog  P(yn|Vs-l)- 

n  =  1 

Here  each  ;)(y,i|I’n-i )  i-s  the  qinmtity  ajjjjetired  in  (C). 

If  we  have  several  ctindidate  models,  the  goodness  of 
the  model  can  Ije  evaluated  by  the  value  (jf  .410  defined 

by 

AIC  =  — 2max/(0)  +  2(number  of  jJtirameters).  (9) 

Thus  the  best  choice  of  the  model  c:m  be  made'  by  looking 
for  the  one  with  the  smallest  value  of  .410'. 


3.  Numerical  Implementations  of  the  Gen¬ 
eral  Smoothing  Formulas 


In  this  section,  we  will  show'  numerietd  method.s  for 
implementing  the  fonnuhis.  For  di.screte  distribittion.s.  the 
implementation  is  ('asy.  Therefore  in  this  section,  we  will 
a.s.snme  that  each  distribution  has  a  (k-nsily  function. 

3.1  Numerical  Approximation  to  the  Densities 

In  ty|>iciil  situation,  the  filtc'ring  and  smoothing  for- 
muhus  can  be  im|)lem('nt ed  by  using  the  following  o|)era- 
tions: 

•  nordinear  transformation  of  state 

•  convolution  of  two  densities 

•  Bayes  formula 

•  normtdization 

These  o[)<'riitions  can  be  realized  by  using  numerical  a|)- 
[jroxiniation  to  the  densities.  In  Kitagawji  (  19S7).  etich 
density  function  wtis  ;i]j]>roxim;ited  by  ii  continuous  [lic'ce- 
wise  linear  (first  order  spline)  function.  Here  we  will  show 
a  sim|)le  nu’thod  based  on  step-function  approximation. 
Each  function  is  expressed  by  thi' numbei' of  segments,  k. 

location  of  nodes,  x,,  (i  —  1) . k).  and  the  value  of  the 

density  iit  each  segment,  ;j,,  {i  =  0 . k).  SiJeriHi'tdly,  we 

tise  the  following  exiiressions:  ;>(  .r„  |I _  i  1  ~  {  -C, , 

7'(x„|In)  ~  -s.  {k.  .r, Ii(x}  -- 

{lay,  xy, ,  </, ),  r(,r)  ~  { 7i ,  .rr,,  i', ).  W'e  denoli'  these  func¬ 
tions  by  p„{.r)  .  /„(■'')■  y(.r)  and  j(,i  ),  respectively. 

For  sim|>lirity  we  assume  that  ihi'  nodes  ,iie  e(|ually  s]iaced 
ami  th.at  A.r  =  .i ,  -  .r,_j . 

•  Convolution 

Consider  the  coiiJolulion  of  two  deiisilii-s  yl.r)  and 
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/„_](x).  This  call  bo  done  by: 


roo 

Pn,  =  Pu{r,)  =  /  q{-r,  ~  y)f„-i{y)<!y  (10) 

J  —  oo 

=  2-,  <?(•>'.- 

,=1  “'l/J-l 


=  Ax 


here  ij  satisfies  x,j  =  .i',  —  yj. 

•  Nonlinear  Transformation  of  State 

Assume  that  the  density  of  x  is  given  as  /(x)  and  that 
we  consider  the  density,  l>{y).  of  y  =  <i(x).  If  y(x)  is  a 
monotone  function  with  an  inverse,  then  li{y)  is  obtained 
by  /(!7''(x„))^.  But  in  general,  h{>j)  =  can 

be  evaluated  numerically  by  the  following  algoritlnn: 

For  1  =  1  to  k 

•  yn  =  min{g(x,_.  ),  y(x,)} 

•  ya  =  max{y(x,-i  ),y(x,)} 


•  In  = 


W,  =  ^  +1 


•  for  j  =  ( 0  +  1  to  ,'| 

-  yi  =  inax-{y(,,.r,„  +  {j  -  l)A,i  ) 

-  y-i  =  min{y:i,x„  +;Ax} 

-  h.  =  h,  + 

•  Bayes  formula 

Given  r(x)  and  the  jiredictive  di'iisity  ;y,(x).  /„,  (t  = 
0, 1, ,,,,  k)  is  obtained  liy 

J.  ,  Pn{J-,)r(y  -  H.r,))  p,,,i\, 

/n,  =  /n(-r,)  =  - ^ (H) 

Here  y  is  the  given  olcservation  at  that  stagi'  :iud  = 
fly  ~  k{s,))  can  be  evaluated  dir<-ctly  from  tlie  function 
r(u').  In  (11),  C  is  the  normalizing  constant  given  below, 

•  Normalization 

C  =  /  /',.(-r)'’(y  - /dx))(/.r 

J  -'X 

^  /•T, 

=  XI/  (12) 

.  =  1 

k 

~  Aj  ^  Pni  ■ 

1  =  1 

This  normalizing  constant  C  cnti  b<-  used  for  the  coiniiu- 
tation  of  likelihood  given  in  (9). 

3.2  Implementation  by  FFT 


reduced  by  using  FFT  algorithm  ba.sed  on  the  following 
diagram; 

/(^),  q(^)  — »  p(^)  =/<'/(;/)/( ■'■- y)dy 

F  F-‘  (13) 

F(uj),  Qlu.’)  — s  P{u.')  =  F(^-)Q(^') 

3.3  Gaussian  Sum  Approximation 

In  the  ca.se  of  linear  state  space  model  with  densities, 
another  way  of  iiniilementing  the  non-Gaussian  fileter  is  to 
use  Gaussian  sum  (mixture)  approximations  to  the  den¬ 
sities.  In  this  method  I’ach  density  is  aiiproximated  liy  a 
Gaussian  sum: 

tTl,j 

p(x„lx„_|)  =  ^  o,p,(x„|.r„_,) 

«  =  1 
nir 

/'(ynkn)  =  XI 

J=I 

Pn 

J"  n  1 1  n  -  1  )  —  XI  tn  'r  ( ■^'  ti  1 1  ri  -  I  )  (  1  "I ) 

;  =  i 

”‘/r, 

p(-r„iy„)  =  XZ ) 

1=1 

(15) 

where  each  ip,  is  a  Gatissitm  density  with  a[)propriate 
mean  and  covariance  matrix. 

Using  this  approximation,  the  formulas  for  prediction 
and  filtering  are  obtained  as  follow.s: 


tHr 

=  ;  "ytny */'i  1 1  >1-1  It  ji; t  >1  -  I  ) 

j=I k=l 

=  HT) 

/=! 

Here  =  means  the  reoidering,  ft  =  o/j  (for  some  k), 
*/  =  dj7*.VA(!/n|T..-i )  (for  I)  and  p.t  and  are 

obtaini'd  by  the  Kalman  filter. 

Technical  diffi<uiltics  with  this  miUhod  an'  ;is  follows: 


In  the  above  iini)lementation.  the  most  of  the  com¬ 
puting  time  is  s[)ent  for  the  convolution  apix’ais'd  in  the 
prediction  formula.  This  computation  can  be  significantly 
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•  The  number  of  neci-ssary  Gaussian  terms  increa.ses 
very  rapidly  to  infinity,  e.g..  m -  ni^,,  (m,  ■m,,)". 


•  The  smoothing  formula  cannot  be  directly  realized 
by  this  method 

The  first  difficulty  can  be  mitigated  by  reducing  the  num¬ 
ber  of  Gaussian  components  at  each  step  of  filtering.  Ap¬ 
proximate  fixed  interval  smoother  can  be  ol)tained  by 
fixed  lag  smoothing  which  is  simply  rerJized  by  n\igment- 
ing  the  state  vector. 

4.  Numerical  Examples 

We  will  show  various  applications  of  general  state- 
space  modeling  and  the  geiu-r.'il  smoothing.  The  first  i-x- 
ample  of  4.1  and  4.4  ;\re  taken  from  Kit;ig;iwa  (19S7). 

4.1  Estimation  of  Mean  Value  Function 

We  consider  the  estimation  of  the  mean  value  function 
of  the  datti.  We  use  a  simple  first  ordi-r  Ireml  moih  l 

—  t',, 

>ln  -  +  ll'.e  (IS) 

Here  A  is  the  difference  opertitor  <lefini-it  by  At,,  -  - 

t„_i  and  e„  and  ir,,  tire  white  noi.se  se((uen<-es  that  an-  not 
necessarily  Gtiussian.  We  consider  the  following  model 
class: 

l(h]  ;  (;(  r„  ) 

'■(  ir„  I 

where 

^  “  rio.o)  ■ 

The  maximum  liki-lihood  estiimites  of  r‘  :ind  ci-  for  the 
Gaussian  model.  Model!  oc  |  were  r-’=().()  122.  ci-’=l.()4;} 
and  the  ,4IC  of  the  model  was  1503.0.3.  Fig. 1.1  .--hows  tin- 
posterior  d<-nsities  of  /„  obt;iin<-d  by  tin-  G;mssi;m  model. 
It  ran  be  s(-(-n  that  tin-  |)osli-rior  m<-an  is  drifting  with  tin- 
time  and  do(-s  not  reflect  tin- jumi)  of  tin-  nn  .iii  value. 


I  lU.  I  1  INeleina  'I'li'IlN  of .  ,1 .1  I  b\  I  In-  ( laii-"iaii 

iinnlel  (Kilavawa 


=  C(r‘  + 

=  (2Tcr*)-J  (-xp|--'j-,j  .  (19| 
r-‘-'r(l0 


Fig. I  2  I’osleii,,!  density  of  /„  obtained  la  tin-  non 
(laiissian  iinnlel  (|\itaga«a  I'.I.ST) 


.■\mong  the  class  of  iion-Gaits  ;'.:i  model...  the  miu 
inium  of  .AIC  (MST.SOl  was  attained  at  b  0.75,  r-  -. 
2.2  X  10"-.  1 .1)22,  Fig, 1.2  shows  tin-  posi,  ;io!  .leii 

sity  of  t„  oblaiin'd  by  this  non  Gaussian  modi  !.  C'ompar 
ing  with  Fig. 11.  it  In-comes  i-lciir  tlml  the  non  Ciaussian 
modi-1.  ModeltO.75)  has  better  ability  to  leproduce  the 
jump  of  the  mean  level  autoimil ically. 


I  ig  I  .1  Siasoiial  daia,  isiinial.d  lo  ad  '.:is,,nal  ainl 
noise  eonii'oin  nls  |,\  a  (laii'sian  iii.,.|.  | 


•Non-Gaussian  seasonal  adjustment 

This  method  of  trend  esstimation  ean  be  ext<'iidrd  to 
'■^asonal  adjustment.  The  state  space  model  for  the  sea¬ 
sonal  adjustment  is  given  in  Kitagawa  and  Gersrh  (1984). 
But  here  neither  .sytem  noise  e,,  nor  observational  noise 
u>„  are  assumed  to  be  Gaussian.  Since  the  sttite  dimen¬ 
sion  of  the  seasonal  adjustment  model  is  large,  w<‘  tised  a 
Gaussian  sum  approximation.  Fig.  1.3  shows  the  (jtiarterly 
record  of  the  increase  of  inventories  of  private  conijianies 
in  Japan  {19G5  to  1983).  Also  shown  are  the  trend,  sea 
sonal  arnl  the  noise  coniiioiu'iits  estimated  by  a  Gaussian 
model.  The  es'iniated  trend  is  too  smooth  and  the  sea 
sonal  comiionent  chtmges  gradmdly.  On  the  other  haml. 
Fig. 1.4  shows  the  results  by  non-G:mssian  model.  We  can 
so<'  that  the  trend  jumps  n[)  ;ind  <!own  arroimd  at  1973 
1974  when  the  oil  crisis  took  place.  The  sea.sonal  |>attei'n 
also  changeil  sigtiiHcanl ly  during  this  time  iieriod. 


I  ig  I  1  I  .'I  I  mat  c<  I  1 1  cial .  m  asi  ai;d  .and  in  H'c  .a  iiiipoin  til  ' 
hs  ;i  non  ( laii'si.ati  moch  1 

4.2  Fstimatioii  of  Changing  \7ariance 

\\r  conaidi'i  the  '■  ,1  iiiial ion  of  changing  \.i  :ance  of 
a  nonst  at  ionai  y  time  se'i.,.  .A.-siime  that  the  i  ane  si- 

rii's  i/| . I/..V  i.s  an  iiidi  pciideiil  Gau-sian  sc(|uence  with 

time  varyitig  vaiiatni  c,,,.  Then  the  tiansfoimed  seiii-.s 
a,„  defieneil  by 

a,„  log!  ,  .  e-„  I  ■:’!  i 

has  the  jn Opel  ty  t hat  a  1.  a'  i  .  i-  aii  mdep.  i,.!i  nt  lan 
dom  variable  distnbuled  a-  the  logaiithm  ot  an  e>:|io 


nential  distribution.  Therefore,  wi-  can  estimate  the  log- 
variance  by  using  the  model 

A‘t,„  = 

II  m  =  f,„  +  II',:,.  (22) 

with  r(  w)  =  t  Tp  { le  -  (  ”  } , 

•Smoothing  periodogram 

Siiici-  the  periodogram  is  distributed  a.s  an  exponen¬ 
tial  distributuioii.  as  a  natural  application  of  this  method, 
we  can  smooth  the  log-]ieriodogram  by  the  above  method. 
In  spline  smoot  liing,  Wahbal  lOSO  )  itpproxi mated  I  he  den¬ 
sity  r(ie)  by  a  Gaussian  distribution  with  the  same  first 
and  si'cond  moments  as  those  of  r(a-).  But  here  by  our 
method,  we  can  smootli  the  series  without  using  Gaus¬ 
sian  appro.ximation.  Fig. 2  shows  the  log-jicnodogram  and 
the  smoothed  log-periodogram  obtained  by  the  .AIC  best 


0  0  1  on .  fi  .''1(1 .  n 
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model,  e,g.,tlu’  socoiul  order  trend  model  (k-  —  i)  with 
Cauchy  system  noise  input. 


/***  ainl  tire  Cauchy  and  is  Gaussian.  W'e  can  s<'e 
that  the  arrival  of  i’  :ind  S  waves  are  clearly  detecteil  hy 
this  method. 


4.3  Estimation  of  Changing  Spectrum 
It  is  w('ll  known  that  the  'oefficients  of  ;in  .\R  iinxlel  can 
he  estimated  recuisively  liy  ohtaining  partial  autocorrela¬ 
tion  coefficien's.  W’e  I'xtend  tliis  method  to  nonsl;ition:iry 
.\R  model 

K 

Un  =  51  +  'c,..  (23) 

I-  I 

The  models  for  smoothing  time  varying  ])artial  iiutocor- 
relation  coelti-ii'iits  are 


,p-ii  , 

Ar/;., 

=  >'t. 

./.i'-"  + 

—  f/,. 

N  .‘/?l 

/■(*.-) 

•  J  ft 

and 

are  resjM'ctiv 

wjird  ;md  backward  prediction  ern.r  of  thi'  ;iutoregri'ssiv<’ 
model  of  order  k.  In  stationary  ctise.  r„  =  u.,  —  0  tind 
tJicn  =  cj.,,  are  identictil  to  tlie  partitd  ;iutocorre|al ion  coef¬ 
ficients.  Tim  distriinition  of  the  noise  inputs  intiy  hi-  either 
Gaussian  or  non-G;inssi;m.  After  estinuiting  tiine-viirying 
•AR  coeftiri(>nt.s  liy  the  smoothing  method,  we  can  estiin;it<‘ 
the  insttml.metais  spectrum  of  non.-:t  at  ionary  proc<'ss  hy 


M  +  )r 

Fig. 3  shows  f  *"•  est  ima ted  changing  spectrum  of  a  .seismic 
(lata.  The  AR  coefficients  tire  estitimted  Iw  assuming  tluit 


4.4  Inhomogeneous  Discrete  Process 

The  genend  state  sptice  model  can  he  apiiliecl  to  the 
estim.'ition  of  time-\arying  mean  of  the  di.screte  distrihii- 
tions.  We  consider  the  numlx'r  of  rtdny  days  over  1mm 
in  Tokyo  for  each  day  during  19S3-84.  The  prohlem  is  to 
estimate  the  jirohaliility,  of  occurence  of  lainfall  on 
a  specific  calendar  d.iy  which  is  l)el'eved  to  he  gradually 
changing  with  time. 

We  esliimated  the  prohahility  of  r.ainfa'l  hy  the  follow¬ 
ing  model: 

j . ■  (2Gi 

Here  ij„  =■  logj;<„/l  1  -  /,.  i>  the  nmnl>er  of  ohserva- 

tions  at  the  n-th  iliiy.  tn„  the  numher  of  r;un\'  dtiys  and 
/>„  the  time  d.  pendent  me:m  of  tlie  liinomitil  disirihntion. 
The  estitunted  ridnfall  j)rohal)ility  for  Tokyo  olitained  Iw 
this  metliod  is  shown  in  Fig. 4. 

•Nonsiationary  Poisson  Process 

The  sami  mi  I  hod  ciin  he  used  for  the  estimation  of 
time-varying  mean  of  tlie  Poisson  distrihution.  This  ))rol)- 
lent  occures  in  the  tmalysis  of  X-ray  stars.  Tiie  main 
prohlem  is  the  analysis  of  iiuasi-periodicity  of  tlie  time 
fluctuation  of  the  Poisson  mean  ohserved  at  ti  satellite. 
But  for  some  stars,  the  time  constant  is  very  short  (e.g. 
mili-.si'cond).  and  the  signal  change  of  metin'  is  only  a 
few  percent  of  the  mean  level  (and  the  variance)and  the 
scries  looks  like  the  one  with  very  low  signal  to  noise  ra¬ 
tio.  Foi  such  a  serii’s.  thi.s  method  c;m  he  used  to  extract 
the  signtd. 


f  ft  A  >•  .1  J  A  S  0  ‘I  r 


Fig  .4  Fslimaleil  <  liangiiig  .speetnim  of  a  .-'eiiiiic  ilala 


Fig.  I  l.stimalid  lainrall  plohiihihl>  foi  lokyo 
ihilagawa  I'.i.ST) 
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4.5  Quasi  Periodic  Process 

The  fiunous  Wolf  simsj)ot  iinmlx’r  data  ('xliil)its  the 
approximate  ri'l)i't  it  ion  of  a  paltcin  lint  both  tlic  period 
and  the  aini>litnde  are  not  so  definite  and  ehanj;e  j^rad- 
ually.  This  tyiie  of  i)henomena  ean  he  seen  in  eeolojpral 
data  (e.e;.,  the  Canadian  lynx  data)  and  many  varieties 
of  air  polntion  data.  Although  sueli  series  are  fre<juently 
modeled  hy  .AR.  .ARM.A  or  .AR  jilus  sunisoidals  modi'ls, 
none  of  those  moih'ls  seems  (piite  satisfactory  for  the  pre¬ 
diction  with  more  than  one  lead  time.  For  a  time  scries 
with  (piasi  periodic  cliaract<'r,  hy  using  a  model 


A‘/,.  =  e,. 

A^^e,.  =  i/„  (27) 

tu  rri  j 

E'hsin/.,  +  Y.  hj  cos  >  +  ;c„. 

J=U  J=:l  J 

we  can  estmnite  i)has<’  and  ain])litnde  of  the  model.  Fig. 5.1 
.shows  tlie  simsjiot  nmni^er  data,  tlie  estimated  i)hase.  log- 
amplitude  and  the  cyclic  function  of  the  inodid 

Jo  check'  the  jirediclloir  ahility.  we  collsidel  a  simple 
model  witli  constant  ain|>litnde: 

(/„  =  riis(H„)  +  IC„ 

+  ll.2n'(  I  +  (I.IIW,,) 

/„  =  I..S.')h,_|  -  -f  c„  CJ.S) 

ir„  ~  .V(O.O.ni) 

C„  -  .V((l.  1) 

1(10  olisei  vat  ions  was  gi-neiateil  with  this  model  The  pa¬ 
rameter  of  the  modi  I  and  I  lie  st ate  ai e  est  mial ed  hy  using 
the  first  1 20  ohser\ at  ions.  I  he  second  oidei  t  rend  model 
was  tlie  .AlC  best.  Fig. ."1.2  shows  tin'  iticieasing  liotizon 
(iredii'l ion  of  the  state  .r„  (oi  H„)  and  of  the  ohseivation 
!/„  obtained  by 

=  I  /'(.r„  +  i.|.r„  +  i.-i  +  )l/.^„  +  ^_| 

/'(.V.,+i|T„)  =  ^ /'('/n  +  riJ'i.  +  r  )/'Frn  +  r|Fn)i/j'„  +  k  C-’h) 


Fig  5  I  Sunspot  iinnihei  data,  est  iiiiated  phase,  ampitt  iid< 
and  c\c|ic  function  of  the  model. 


rig.-'),2  I’lediet  i\e  density  of  ;i(  r,,  1 1  i  .'n )  (n  —  121.h)-l) 


hand  sid('  of  (hr  li)j,urr  show  tlu*  trsulls  l>y  onr  iioiilitifar 
filU'i'  anil  siiioiil  lii'i .  I'lii’si'  linun'i'  slmw  a  typic-al  mI  iial  Inn 
wlinrp  llicsn  l«'ii  alf^dt  it Itins  yii'lil  i(iiili'  ilillfrnnl  ri'snils. 

I'ift.G.li  sliDWs  the  poslf'iinr  di'iisily  p(j‘„|r„)  nljlaiin-il 
liy  (he  <-xli'nili-‘l  Kalman  IllliT  Kasnl  liiirai  izi'd  i,mi)i>l  In-i . 
1(  ran  lir  sra-n  llial  t  Im  oslimalf's  liavo  llir  Irnili'lii  y  of 
(I ivrrj^i’ncc  anil  lioi’s  nul  satisfai’linily  ir|)iuilni'0(l  1  lir  trur 
signal  slnuvn  l•'lg.(i.l.  On  (In-  Dtlmi  liatnl.  I'ig  (i.  1  slimvs 
tli<‘  ifsnlts  liy  nni  nonlinrat  siiiddI  Imr.  Wr  l  an  si-c  llial 
I'fiiiai  kalily  giiml  msiills  was  iililaiiii’il  liy  mil  siiiiml lii’i . 


0.0  ?5.o 


SO.O  15  0 


uo.o  leo.o 


0.0  J5.0  50  0  o  100.0 

I'ig  fi.l  Siiinilali'il  j-„  wliii'li  is  assiiiiii'il  In  lir  Mlikown 
anil  I  In'  nl'si  i  val  iniis 


I'i-  .j.;!  I’ri.ilii'l ivi.  ilf'iisily  of 


K.'iliiiaM  fillri  II. in ■  ( I;iu>sl;in  siM. .1  il  111' 


4.G  Noiiliiu'ar  siiidotliiiig 

U'l' I'liiisiili  i  llii  ilala  art  ilii’ially  gi'in  rali.il  liy  llii.fnl- 
liiwiin;  miimIi.1  w  lin  li  was  m  ininally  iisi  il  l.y  Ainliaili'  Ni-lln 
I'l .  al  (  I  llTH)  ail'  I  mi  1 1 1  i'  Mil'll  in  I  In-  ii'imii'li'i  uf  l\ilai;awa 
(  I'.I.ST): 


-  j-,,.!  I'  - - r—  4 

■J  l+'i'.-.m 


■Jn)  +  r„ 


rim  ilala  is  slmwii  in  I  m.li.l  I  Im  innlili'm  is  In  i-sii- 
malr  I  Im  1 1  im  sign  al  .r„  frnm  I  Im  si  ,  |imiir'. ■  i  if  i  ilisn  val  n  Mis 
{ //„  I  .a-ssiiiiiiiig  llial  I  Im  iiimli'l  (dll)  is  kimwii.  Oiii  nnii' 
lini'ai  lilli'i  aii'l  siimnllmi  win'  a|'|ilii'il  I"  lliis  |mi>IiIi'iii 
l  or  riMiiiiai  isiMi.  I  Im  wll  kiaiwii  I'xli'ii'li'il  Kalman  lilli'i 
was  nisn  applii  ll.  I'ig  li.l.’  sllliWs  llm  pnsIllIlM  ill’llsil  li's 

/'(j'i"|l ...  I  ■  m  ^  !•' . 'll  aiiil  Kill.  I  Im  ili'iisil ms  siinwii 

III  lli<'  I'fl  liaii'l  s|.|c  i.f  |||i.  ligni'.  all'  nlilain'''l  Ky  lli>' 
I'Xli'iiili'il  Kalman  lilli  i  ainl  llm  lim'ai  li/i  'l  lixi-d  inli-ixal 
siiinnllmi  (Sagi  aii'l  \|ii'sa  I'.lil)  III''  'Mn-s  m  llm  iiglil 


rigC'.'J  I’nsl'  llnl  '1.  Ilslll's  ,,f  pl.l'i  |),„).  Ill  =  IK . I'll 

aii'l  I  III)  "I il  aiii''' I  l.y  ''Xli'ii'l'''!  Kalman  lilli-i  ami 
ii'iii  ( iaiissiaii  sni""l  In  i 


FiK-d -i  i’osicrior  (Irnsily  of  /)(j-„|V^)  <il>laiiir.l  by  ( li<- 
f'Xtoii(lo<l  Kalman  fillpr  hasod  smootiu'r. 


Kig.d.  l  I’osicrior  dciisily  of  olilaincd  l>y  llm 

niiii-(iaiissian  siiioot her. 


riio  sfH'oml  oxampk’  of  iionlinoar  smoolhiiig  is  a  pas¬ 
sive  reeoiver  prolilem.  A  similar  prolilem  was  idiisi'lerol 
by  Hiny  and  Seime(  I  i)7()).  In  I  liir^  exampli'.  the  l-ugel  is 
gradually  movim;  on  tlie  I  wo  dimensional  space.  I'ig.li.') 
shows  an  example  of  lids  (rajeclory.  lids  (argel  is  ob 
served  by  llie  scalai  iioiiliiiear  iiU’asiiremenl  function 

(y„  =  /i(.r,',.r')  -I-  !(•„  (dl) 

where 


d„  =  -p  Ad 

Here  dn  and  Ad  are  i;iven  conslaiils  and  ic,,  is  a  (iaiissian 
white  noise  with  kiii>wii  variance  rT~  l  ids  is  a  simple  ex¬ 
ample  of  vecloi  Irackinn  problem  of  a  iiior  iiii;  object  by 
observiiiu,  I  he  relal  ive  aiiu,|e  observed  on  a  rot  at  illy,  obser¬ 
vatory.  I'ig.b  ti  shows  an  example  of  ;/„  w  hich  isgem-raled 
by  tin-  modeb'3])i  I'oi  the  eslimalioii  of  this  moviiiu.  ob 
jecl.  we  coiisiilev  the  followini;  sim|i|e  '.moot  hiiess  pi  ioi 


Here  |d  and  e,'!  are  miiliially  independent  (iaiissian  while 
noise  seipience  with  variances,  and  rj.  respectively. 
I'he  smoothiie.ss  prior  modid  (.13)  with  the  observation 
model  (31 )  con  St  It  iite  our  nonlinear  slate  space  tiiodel  f<ir 
estimating  the  loi'afion  of  the  object.  It  shouM  be  notcxl 
that  the  (iaiissiaiiity  of  neither  e„  nor  ie„  are  necesssary 
in  our  model.  The  value  of  rf  and  t;  are  unknown  but 
can  be  eslimale<l  by  max imiziiig  I  he  |og-likelilioo<l  defined 
by  (,S).  I'ig.ti.T  shows  the  contour  of  the  posterior  density 
j-^, |1  v )  for  ti='2().  10.00. and  HO. 


I'ig.O.f)  'rrajeclory  of 


l-'ig.li.li  ()bs('r\ed  l/„ 


6.  CONCLUDING  REMARKS 

By  I  iio  uso  (if  gf'iHMal  iKiii-Ci'adssiati  st/itc  .•<(ia<v  iti<><lrl. 
wfi  ran  (rrat  various  lyprs  of  tiinr  srrirs.  Hrrursivr  (illrr- 
illg  and  sinootliing  foi  iiiuUiS  ran  hr  dri ivrd  for  (liis  grnri  ir 
State  spare  model.  Tlie  direct  mmieriral  met  hod  for  non- 
Gaussian  filtering  and  smoothing  is  practical  at  least  for 
lower  dimensional  prohlem. 

For  higher  order  system,  however,  it  involves  inten¬ 
sive  computations.  I  he  most  significant  part  of  the  com¬ 
puting  time  is  spent  for  (he  convolution.  Ihe  amount 
of  the  compulation  for  convolution  is  roughly  ('■'  order 
of  here  k  is  (he  nuiidn'r  of  .segments,  iii  is  the 

state  dimension  and  a  is  Ih'-data  length.  For  rough  idea, 
the  Cl’l'  time  spent  for  examples  in  section  1.1  (m  =  1) 
and  t.f)  (in  =  2  and  I)  are  (!  secmid  and  t'iO  second,  ri"- 
spectively,  hy  a  I  t  MIFS  computer.  Obviously  for  higher 
order  systems,  we  mssl  more  additional  effort  t  hat  will 
inclufle  the  use  of  supercomputer,  development  of  special 
algoerithm  or  hardware  for  fast  convolution,  integration 
or  FF  1  and  reiinction  of  the  niimher  of  necessary  opera¬ 
tions  based  on  spline  or  Gauusian  sum  approximations. 

In  our  mod('l,  it  is  a,ssume<l  that  t he  noise  dist ribnt ion 
is  nonsigiilar  and  the  conditional  distribution  ) 

is  well-defined.  Anshy  and  Kohn  (I'.lftT)  pointerl  out  that 
this  does  not  the  case'  for  some  important  problem.  Nec¬ 
essary  modification  for  such  a  case  is  shown  in  that  artiide 
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1.  Introduction 

Interior  point  methods  for  linear  programming  problems 
are  certainly  not  new.  Many  people  have  been  intrigued 
by  the  notion  of  going  through  the  polytope  rather  than 
proceeding  from  vertex  to  vertex  around  the  polytope  as 
required  by  the  simplex  method.  This  idea  received  fresh 
impetus  from  the  startling  anouncemenls  of  Karmarkar 
Kar84]  who  claimed  that  his  new  interior  point  method 
based  on  projective  methods  solved  a  certain  large  linear 
programming  problem  50  times  faster  than  the  simplex 
method.  Furthermore,  this  algorithm  was  provably  poly¬ 
nomial  in  the  number  of  operations  required.  Thus  for  the 
first  time,  there  was  a  polynomial  algorithm  for  linear  pro¬ 
gramming  that  actually  held  the  promise  of  outperforming 
the  simplex  method.  Since  then,  there  have  been  many 
studies  related  to  Karmarkar’s  method  and  renewed  in¬ 
terest  in  other  interior  point  methods  such  as  the  barrier 
function  method  and  IDiard’s  method  of  centers. 

In  this  paper  we  present  computationally  efficient  in¬ 
terior  point  methods  based  on  Huard’s  method  of  renters. 
We  confirm  the  need  to  keep  iterates  close  to  the  center  tra¬ 
jectory  of  the  polytope  in  order  to  get  good  performance, 
and  we  derive  and  present  numerical  results  for  two  specific 
multi-directional  procedures.  The  best  of  our  procedures 
is  based  on  solving  a  two-dimensional  linear  programming 
problem  at  each  step.  This  method  compares  very  favor¬ 
ably  with  other  recent  interior  point  procedures  reported 
in  the  literature. 

In  the  presentation  that  follows,  we  take  the  linear  pro¬ 
gramming  problem  to  be  in  the  form 

min„c'^u 

subject  to  Au  <  b  ' 

where  c.  u  C  R",  A  C  R™*",  and  6c  R"*-  Although  we 
assume  that  the  problem  is  bounded  and  that  .1  has  full 
column  rank,  it  is  not  necessary  to  assume  that  the  con¬ 
straints  have  a  full  dimensional  interior  since  the  big-M 
procedure  used  here  to  find  an  initial  feasible  point  will  al¬ 
ways  have  one.  In  that  case,  the  Phase  I  solution  will  be 
the  optimal  solution. 

The  remainder  of  this  paper  is  organized  as  follows.  In 
§2  we  give  a  description  of  Huard’s  original  method  of  cen¬ 
ters,  and  we  consider  some  generalizations.  In  particular, 
we  show  that  smooth  trajectories  exist  that  connect  any 
initial  feasible  point  to  an  optimal  solution.  Some  of  these 
trajectories,  however,  get  arbitrarily  close  to  an  exponen¬ 
tial  number  of  vertices,  and  we  argue  that  ri’centering  is 


therefore  desirable.  In  §3  we  derive  two  specific  algorithms 
that  incorporate  a  recentering  strategy.  Finally,  in  §4  we 
give  some  numerical  results  that  show  the  promise  of  our 
approach.  The  details  of  the  methods  and  results  prr  sented 
here  are  contained  in  BDD\\88;:  additional  theoretical  de¬ 
velopments  are  in  U'BD88i. 

2.  The  Method  of  Centers 

In  this  section  we  describe  the  method  of  centers  and  show- 
how  to  obtain  a  smooth  trajectory  rather  than  a  sequence 
of  points.  We  generalize  these  results  by  posing  an  ini¬ 
tial  value  problem  in  ordinary  differential  equations  whose 
solutions  are  trajectories  that  connect  any  feasible  initial 
point  to  an  optimal  solution.  The  problem  of  trajectories 
that  get  arbitrarily  close  to  an  exponential  number  of  verti- 
cies  is  then  examined.  We  overcome  this  problem  without 
sacrificing  the  essential  properties  of  the  original  system  by- 
deriving  a  modified  differential  equation  that  has  a  recen- 
tcring  component. 

The  notation  required  to  describe  the  method  of  centers 
is  defined  as  follows.  Let  the  set  of  residuals  corresponding 
to  the  constraints  of  (1.1)  be 
T 

r*(u)  6*  -  .-1  Ir  -  1 . m. 

Note  that  if  rii,(u)  >  0,  k  ;  l,...,m,  then  u  is  a  fea¬ 
sible  point.  Next,  define  a  residual  corresponding  to  the 
objective  function 

ro(u,f)  -  t  c^u. 

Here  t  is  a  scalar  variable  that  is  meant  to  correspond  to  a 
previous  value  of  the  objective  function.  In  particular,  let 
Uo  be  a  feasible  point  and  let 

T 

fo  —  C  Uo. 

Then  if  ro(  u.to)  >  0,  u  yields  a  lower  objective  function 
T’  T' 

value  than  Uo>  i-f-i  c^u  <  r'uo- 

The  Cl  liter  of  the  poly  tope  defined  by  the  constraints  of 
(1.1)  and  the  objective  constraint  for  1  '  to  is  the  feasible 
point,  U).  that  solves 

m 

max  log  ro(u,/o)  H 
“  I  k  i 
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logro(u,/o)  T  X^logr*(u) 
k  -I 


Now  set 

T 

tj  c  Uu 

and  define  3S  the  solution  to 


logro(u,ti)  +  X! 

fr.  1 


Continuing  this  process,  a  sequence  of  iterates  {u,}  is  ob¬ 
tained.  It  can  be  shown  that  {u,}  converges  to  an  optimal 
solution  as  i  *  oc.  This  procedure  is  Huard’s  original 
method  of  centers  Hua67  applied  to  the  linear  program¬ 
ming  problem.  An  implementation  of  this  method  was 
shown  by  Renegar  Ren86;  to  possess  an  equivalent  poly¬ 
nomial  complexity  bound  to  that  of  Karmarkar’s  original 
method  Kar8-1  . 

It  is  easy  to  see  that  by  continuously  moving  the  con¬ 
straint  corresponding  to  the  objective  function,  one  obtains 
a  continuous  trajectory  rather  than  a  set  of  points.  Every 
point  on  that  trajectory  can  be  viewed  as  a  function  of  /. 
Specifically,  let 

m 

L(u.l)  -■  logro(u,0  i  )|_]logrt(u). 

k  1 

Then  for  any  value  of  (,  u(/)  satisfies 


0.  (2.1) 


By  differentiating  (2.1)  with  respect  to  t,  an  e.xpression  is 
obtained  for  the  change  in  u(/)  as  a  function  of  I,  i.e., 

Vut,f,(ii,/)ti'(t)  *  Vui/-(n,0  0. 


or 

u'(/)  V„.T(u,()  'V„,T(u./).  (2.2) 


While  the  abuse  difrereiitial  equation  ( liaracterizes  the 
trajectory,  an  initial  condition  needs  to  be  supplied  to  com¬ 
plete  the  specification.  We  consider  therefore 


^'{1)  V„„/-(u.f)  ‘V„,/.(u,t) 

u(/„)  -  u 

T 

to  c'u  •  ( 

(  ?  0. 


(2.3) 


7'  /' 

By  taking  u  Ui  from  above  and  r  c'uo  c'u  we 
obtain  the  desired  trajectory.  It  is  of  interest,  however,  to 
consider  any  initial  feasible  point  u  and  any  <  >  0.  and 
to  assess  the  solution  to  (2.3).  In  general,  for  any  such  u 
and  f . 

<1  V„ /.(?/, /o) 

is  not  equal  to  zero.  I  hus.  if  we  require  that  along  a  tra 
jectory  Vg{t). 

V,J,{u„(l),l)  g. 


then  u'g{t)  satisfies  (2.2)  and  hence  the  initial  value  problem 
(2.3).  When  g  -  0  the  resulting  trajectory  is  referred  to  as 
the  center  trajectory  while  if  g  /  0  the  trajectory  is  called 
an  off-center  trajectory.  The  theoretical  properties  of  these 
trajectories  are  contained  in  |\\  BD88  . 

Computing  the  actual  derivatives  of  I.  and  substituting 
these  in  (2.2)  yields 


J’ 


^D^.\  t  1  "t  (2f) 

(f  C^uy  )  [f  CM/)- 


where  D  is  defined  as 

I)  diag  I  ,  A-  1 , .  . .  ,  m  I 


For 


.V  diag{  ^  - 


(2.4)  ran  be  rewritten  as 

n'(t)  -  [.  JX.A  4  rt^  '  c.  (2.5) 

.Applying  the  Sherman-Morrison- Woodbury  formula  to  the 
matrix  in  (2.5)  results  in 

u'{t)  g{A'^\.A)  '  c  (2.6) 

where  rj  is  a  scalar.  .A  numerical  procedure  for  solving  lin¬ 
ear  programming  problems  can  then  be  obtained  by  numer¬ 
ically  integrating  (2.5)  or  (2.6).  The  use  of  Euler's  method, 
for  example,  yields  a  direction  that  's  known  as  the  dual 
affine  direction  ,.AR\  86  . 

In  Witzgall  et  at.  WBD88j,  we  show  that  all  trajecto¬ 
ries  converge  to  a  single  optimal  solution,  even  when  the 
optimal  solutions  are  not  unique.  In  the  rase  of  a  single  op 
timal  solution  at  a  vertex,  all  of  the  trajectories  converge  to 
that  vertex,  and  the  tangents  of  these  trajectories  similarly 
converge. 

.-As  we  discuss  in  B1)DW88  ,  there  are  paths  that  stay 
arbitrarily  close  to  the  boundary  of  the  jiolytope  (see  also 
MS86  ).  rhis  is  corroborated  by  the  fart  that  VuL(u(t),t) 
stays  constant:  if  it  is  large  initially,  which  it  will  be  if 
U(i  is  close  to  the  boundary,  then  it  must  stay  large  there¬ 
after.  I'hus  the  dual  affine  direction  ran  produce  long  path.i 
MS86  in  tl'.e  polytope.  i.e..  paths  that  visit  an  exponen¬ 
tial  number  of  vertices.  These  jiaths  mirror  the  long  paths 
c-xhibited  by  the  simplex  method,  and  it  follows  that  poor 
performance  is  possible. 

Recall  however  that  the  center  trajectory  is  I'efined  by 
the  ((indition  that  L{u.t)  be  maximized  as  a  function  of  u. 
The  natural  method  of  solving  this  optimization  problem 
is  Newton's  method  where  the  step  is  given  by 


V„J.(U,0  'XJ.(,i.t).  (2.7) 
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We  can  incorporate  this  direction  into  the  differential  equa¬ 
tion  (2.2)  to  obtain 

'  [Vu,L[u,t)  - 4,V^L{u,t),  (2.8) 

for  an  arbitrary  positive  constant  d>.  The  sign  appears  to 
be  wrong,  but  recall  that  the  integration  is  backwards  from 
to  to  the  optimal  value  /*  <  to¬ 
ll  would  seem  that  (2.8)  would  not  satisfy  any  condition 
such  as  (2.1).  By  differentiating 

VL(u{t),t)  = 

with  respect  to  t,  however,  and  solving  for  u'(t)  as  before, 
we  obtain  (2.8).  Thus,  along  any  solution  to  (2.8),  the 
value  of  !  VuL  decreases.  It  does  not,  however,  go  to  0 
since  the  upper  limit  of  integration  is  t"  and  not  oo.  The 
amount  of  recentering  correction  can  also  be  made  to  vary 
with  t.  If  ^ 

^(0  =  / 

Jto 

then  deriving  the  differential  equation  with  respect  to  t 
using 

V,L(y(t).t)  = 

yields  (2.8).  but  with  replaced  by  <f>(t). 

The  theoretical  properties  of  (2.8)  are  contained  in 
WBD88  .  .Note  however,  that  while  recentering  is  intu¬ 
itively  appealing  when  far  from  the  optimal  vertex,  it  may 
actually  slow  final  convergence.  In  BDDW88!  we  observe 
that  a  negative  component  in  the  recentering  direction  of¬ 
ten  yields  a  better  direction  as  the  iterates  approach  the 
optimal  vertex.  Further  discussion  on  this  point  is  included 
in  §3,  which  explores  numerical  algorithms  based  on  the 
search  directions  derived  here. 

3,  New  Algorithms 

In  the  previous  section  v.o  motivated  the  choice  of  the  dual 
affine  search  direction  by  using  Euler's  method  for  ordi¬ 
nary  differential  equations  (ODE).  Different  search  direc¬ 
tions  could  be  generated  by  considering  other  methods  for 
the  numerical  integration  of  initial  value  problems.  Since 
the  precise  determination  of  the  actual  trajectory  from  an 
initial  feasible  point  to  the  .solution  is  not  of  interest  here, 
the  ODE  analysis  is  used  only  to  suggest  search  directions; 
other  considerations  dictate  the  distance  to  travel  in  those 
directions. 

For  example,  a  typical  dual  affine  procedure  uses  a 
steplength  that  is  a  large  percentage  of  the  distance  to 
the  boundary  of  the  polytope  and  thus  does  not  attempt 
to  follow  a  single  trajectory.  One  can  easily  see  that  this 
type  of  algorithm  might  perform  poorly.  While  the  current 
estimate  might  be  on  a  “well-behaved”  trajectory,  the  next 
jlciate  might  be  on  a  trajectory  that  gets  arbitrarily  rlo.se 
to  an  exponential  number  of  vertices  of  the  polytope. 


Long  paths  can  be  avoided  by  including  a  recentering 
component  in  the  search  direction.  Such  an  approach  aims 
at  keeping  the  iterates  more  interior  to  the  polytope  which 
helps  maintain  nonsingularity  in  A^D^A.  X'arious  strate 
gies  for  combining  a  recentering  direction  with  the  dual 
affine  direction  are  possible.  Such  “multi-direction'"  meth¬ 
ods  are  discussed  in  the  remainder  of  this  section. 

A  mulli-diTeclion  method  attempts  to  combine  a  cost 
improvement  direction,  with  a  recentering  direction, 
.Sr,  so  that  the  iterates  remain  sufficiently  close  to  the  center 
trajectory  and  thus  do  not  display  the  convergence  prob¬ 
lems  described  above.  The  value  of  multi  directional  search 
procedures  is  well  appreciated  in  the  recent  work  involv¬ 
ing  interior  point  methods,  e.g.,  Kar85  .  Closer  analysis 
by  Gonzaga  iGonST)  of  various  standard  interior  point  ap¬ 
proaches  demonstrated  that  each  consisted  of  two  basic 
directions;  a  cost  improvement  direction,  and  a  recenter¬ 
ing  direction.  In  this  study,  we  consider  three  different 
multi  dimensional  search  approaches  using  both  the  stan¬ 
dard  search  directions  and  new  ones.  These  approaches  are 
called  the  composite  method,  the  two-step  method,  and  the 
two-dimensional  subspace  method. 

The  composite  method  is  conceptually  the  least  compli¬ 
cated  of  the  multi-direction  methods.  Lsing  this  method, 
the  two  component  directions  are  combined  at  each  itera¬ 
tion  to  form  a  single  direction 

S  -  Sc  -(-  ej>{t)Sr 

as  suggested  by  (2.8),  where  d>(t)  is  a  weight  that  deter¬ 
mines  the  contribution  of  each  component  to  the  combined 
direction.  Unfortunately,  we  have  not  been  able  to  find  a 
value  of  4>(t)  that  performs  consistently  well  in  practice. 

In  addition,  the  selection  of  an  appropriate  steplength 
for  the  composite  direction  depends  heavily  on  the  value  of 
(i>(t)  selected.  Projecting  a  search  direction  heavily  dom¬ 
inated  by  a  recentering  component  to  within  99%  of  the 
boundary  won’t  necessarily  yield  an  improved  trajectory. 
Likewise,  attempting  to  recenter  using  a  quadratic  line 
search  with  a  direction  dominated  by  the  dual  affine  search 
direction  is  not  practical  since  the  slope  of  the  quadratic 
model  of  Vi,L(/,u)  will  be  approximately  zero  (see  §2). 
Because  of  these  problems,  the  composite  method  is  not 
discussed  further. 

The  tvo-step  method  uses  the  two  component  directions 
independently.  At  each  iteration,  a  cost  improvement  step 
is  followed  immediately  by  a  recentering  step.  This  elim 
inates  the  need  to  explicitly  specify  the  weight  f)(f)  as 
required  by  the  composite  method.  The  two  step  method 
also  eliminates  the  problem  associated  with  the  compos 
ite  method  of  selecting  an  appropriate  steplength.  Since 
the  steps  in  the  cost  improvement  and  recentering  direc¬ 
tions  are  made  independently,  the  steplength  of  each  step 
ran  also  be  specified  separately.  This  method,  described 
further  in  §3.1,  is  found  to  work  well  in  practice. 
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where 


We  derive  the  tivo- dimensional  subspace  method  hy  not¬ 
ing  that  a  cost  improvement  and  recentering  direction,  pro¬ 
vided  they  are  not  co-lincar,  define  a  two-dimensional  cross 
section  of  the  polytope.  Because  of  the  reduced  dimen¬ 
sion,  the  cost  function  can  be  easily  minimized  on  this  two- 
dimensional  section.  The  solution  to  the  reduced  problem 
then  defines  a  search  direction  that  combines  the  original 
two  directions.  The  two-dimensional  subspace  method  is 
described  in  detail  in  §3.2.  As  reported  in  §4,  this  method 
also  performs  well. 

Each  of  these  multi-direction  methods  include  deriva- 

'V 

tives  of  L  that  contain  the  term  (  -  c‘u  and,  implicitly, 
its  initial  value,  e.  Since  no  effort  is  being  made  to  remain 
on  a  particular  trajectory,  it  is  reasonable  to  “start  over’ 
at  each  step,  i.e.,  to  pick  a  new  value  of  (  at  each  step 
and  to  ignore  t.  Thus,  in  the  following,  tlie  derivatives  are 
written  using  f  and  not  t  -  c^u.  For  example, 

V„L(u,<)  ^  AR  -  ^ 


Particular  choices  for  c  are  discussed  below  in  context. 


-  c'l’Aru- 

might  be  preferable  to  ilie  Newton  direction.  The  value 
which  was  originally  suggested  by  Fiacco  and  .McCormick 
F.\168  in  the  context  of  barrier  functions,  minimizes  the 
(2  norm  of  VuT(ii,/).  This  value  produces  a  steepest  de¬ 
scent  recentering  direction  that  is  orthogonal  to  the  cost 
direction  c,  i.e.. 


c^(  A^R  c^VJ.(u.t)  0. 

In  the  simplest  implementation  of  the  two-step  method, 
both  component  directions  are  computed  at  the  current  es¬ 
timate  u,.  The  first  step  taken  is  some  large  percentage 
of  the  distance  to  the  boundary  in  the  dual  affine  direc¬ 
tion,  A'j  lu,  c.  This  step  results  in  an  interme¬ 

diate  point  u,.  The  trajectory  is  then  corrected  using  the 
recentering  direction  computed  at  u,  provided  that  the 
recentering  direction  forms  a  negative  inner  product  with 


3.1.  Two-Step  Methods 

As  outlined  above,  the  two-step  procedure  follows  a  cost 
improvement  step  with  an  independent  recentering  step. 
The  dual  affine  direction 

saa  -  [a'^'dWY'c 

is  the  obvious  choice  for  the  cost  improvement  direction. 
There  are  many  possible  choices  for  the  recenteriiig  direc¬ 
tion. 

One  such  choice  is  the  .Newton  recentering  direction  de¬ 
fined  by  (2.7),  i.e., 

•-  ■ 

where 

cT[aTd^a)  c 

cT{aTd^A)-'  aTr' 

The  choice  of  f„,  originally  suggested  by  McCormick 
McC87],  results  in  a  Newton  recentering  direction  that 
is  orthogonal  to  the  direction  c.  The  value  of  f„  thus 
minimizes  the  norm  of  the  vector 

[A^lfA]  ^  [^A^R  t  j  . 

In  the  presence  of  ill -condit ioning  in  .4'^OT4,th  V  sleep 
esl  descent  recentering  direction 

T  c 

•Sd  A^R 

^  $d 


where  i  is  either  e„  or  c.a  depending  on  which  recentcring 
direction  is  used.  A  quadratic  model  is  used  to  determine 
the  steplength  for  the  recentering  direction. 

The  two-step  method  is  improved  if  the  recentering  di¬ 
rection  is  updated  at  the  intermediate  point  Uj.  For  the 
steepest  descent  direction,  this  simply  means  computing 

^  (J  “■  ■ 

For  the  Newton  recentering  direction,  however,  only  a 
“partial”  update  of  s„  is  made.  .Applying  the  Sherman- 
Morrison-Woodbury  formula  to  (3.1)  yields 

Sr,  =  [a^D\\)  '  a'^R  f /3(f)  (.4^/?M)'' c,  (3.2) 

which  is  a  linear  combination  of  the  dual  affine  direction 
and  a  transformed  gradient  term.  (Note  that  /3  is  a  func¬ 
tion  of  f.)  The  partially  updated  Newton  recentering  di¬ 
rection  is  obtained  by  only  evaluating  and  f„  at  u,. 

Thus, 


(.4V.4)  '  .4^« 

-t3(f„)  :i.  {A^IVa)  '  r. 


(3.3) 


This  direction  is  easily  computed  using  the  already  factored 
form  of  A^D^A.  Since  A^D^ A  is  a  positive  definite  ma 
trix,  this  updated  Newton  search  direction  is  a  trajectory 
improving  direction. 


Our  best  results  for  the  two  step  method  were  obtained 
using  the  updated  recentering  directions,  and  it  is  this  im¬ 
plementation  that  is  reported  in  §4.  Our  two  dimensional 
subspace  methods,  discussed  next,  show  even  better  per¬ 
formance. 
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3.2.  Two-Dimensional  Subspace  Methods 


Observe  that  the  dual  affine  and  rerentering  directions  de¬ 
termine  a  two-dimensional  plane  and  that  this  plane  inter¬ 
sects  the  polytope  to  form  a  two-dimensional  cross  section 
on  which  the  current  estimate  lies.  W'e  obtain  a  search 
direction  by  minimizing  the  cost  function  on  this  cross  sec¬ 
tion.  Given  two  linearly  independent  directions,  s,  and 
sj,  the  two-dimensional  subproblem  is  thus 

niin<.,o  Cic^s,  -f  ^c^sj  _ 

subject  to  -t-  ^'2.4s2  <  6  -  .4u  I  f 


for  scalars  and  ^2-  The  solution  to  this  subproblem 
then  determines  weights  for  the  search  directions  and 
S2,  respectively,  that  define  the  multi  directional  search  di¬ 
rection 

^  +  C2'*2' 

The  solution  to  (3.4)  produces  an  optimal  search  direction 
with  respect  to  Sj  and  S2  at  the  current  point.  Specifying 
a  steplength  completes  the  algorithm. 

The  only  restriction  on  Sj  and  S2,  the  generators  for 
the  subproblem,  is  that  they  be  linearly  independent.  The 
dual  affine  direction  and  the  .\ewton  recentering  direction 
produce  the  obvious  choice  for  the  subproblem  generators, 
namely 


S2  .4^fll„,  . 

The  partially  updated  Newton  recentering  step,  discussed 
in  §3.1,  could  also  have  been  used  to  obtain  S2-  In  our 
computational  studies  however,  we  found  that  the  former 
produced  better  results  [BDDVVSSj. 

We  also  have  examined  the  properties  of  a  second  set 
of  generators, 


3,  =  (a^D^a)'  c 

S3  -  (^A^D^Aj  '  a^, 

where  a*  is  the  first  constraint  encountered  in  the  S] 
direction.  This  choice  of  generators  is  motivated  as  follows. 
Suppose  we  have  a  search  direction 


a,  =  [a'^D^A)  ' d, 

for  some  d\.  (In  this  study,  dj  -  c  so  .s,  is  the  dual  affine 
direction.)  Then  let  k  be  the  index  of  the  first  constraint 
encountered  in  the  direction.  From  the  current  point, 
u,.  take  a  step  of,  say,  99%  of  the  distance  to  this  constraint, 
obtaining  a  point  u,.  Compute  rj'"(u,).  Now  a  rank-one 
update  of  ^A^IPA'j  due  to  the  change  in  residual  k,  can 
be  written  as 


Evaluating  the  new 
Morrison -Wood  bury 
form  of  (a'^D^a] 


A‘D^A 


T 

akCi\ 


T 


ri,(u.y 

Hessian  inverse  using  the  Sherman- 
formula  and  the  previously  factored 
results  in  a  second  direction  33  - 


A)^^l^d,  for  any  choice  of  d  not  orthogonal  to  at,. 
This  direction  has  as  a  dominant  component  in  the  direc¬ 
tion  (^A^D^ A^  ait,  and  if  d  =  d^,  the  new  direction  is 

dominated  by  and  (^A^D^A^  for  the  subsequent 
step. 

I  he  generators  to  the  subproblem  can  be  varied  de¬ 
pending  on  the  location  of  the  current  estimate,  i.e.,  its 
proximity  to  the  optimal  vertex.  This  is  done  in  order  to 
create  a  globally  effective  algorithm  and  to  alleviate  prob¬ 
lems  caused  by  ill-conditioning  of  the  Hessian. 

4.  Computational  Results 

4.1.  Methods  Analyzed 

In  this  section,  we  present  results  for  two  of  the  methods 
described  in  §3: 

•  a  two-step  method  comprised  of  a  dual  affine  step 
followed  by  a  recentering  step;  and 

•  a  two-dimensional  subspace  method. 

The  results  from  a  dual  affine  approach  are  used  as  the 
base-line  for  comparing  these  more  promising  methods.  It 
has  been  shown  in  [M.M871  and  MMS88'  that  the  dual 
affine  method  compares  favorably  to  MI.NOS  5.0  '.MS83  . 
a  well  known  and  widely  available  implementation  of  the 
simplex  method.  Since  our  dual  affine  implementation  re¬ 
produces  the  dual  affine  results  reported  in  .\I.M87  and 
[MMS88  ,  it  is  assumed  that  our  work  would  also  compare 
favorably  with  the  MINOS  simplex  code. 

The  two-step  method  is  implemented  using  a  dual  affine 
step  followed  by  a  steepest  descent  recentering  step  in  the 
early  iterations,  and  a  dual  affine  step  followed  by  a  par¬ 
tially  updated  Newton  recentering  step  in  the  final  iter¬ 
ations.  The  switch  from  one  recentering  direction  to  the 
other  is  based  on  t,  the  residual  to  the  objective  row.  The 
two-step  approach  is  used  in  both  Phase  1  and  Phase  2. 

The  two-dimensional  subspace  method  also  uses  the 
residual  to  the  objective  row,  *,  to  switch  between  strate¬ 
gies.  While  e  >  1,  the  two-dimensional  subproblem  gen¬ 
erators  are 


•^1  “  ^do 

S3  [A'^D^Ay'a,. 

These  generators  work  well  in  the  early  iterations  when 
the  number  of  active  constraints  is  small.  Once  f  <’  1 . 

however,  we  switch  Sj  to  (.4^/lT4)  '.4'^fl  In  both 
cases,  the  two  dimensional  subproblems  are  solved  exactly 
using  the  simplex  method  (see  §4.2). 

I  he  two-dimensional  subspace  methods  were  originally 
configured  in  two  ways.  The  first  used  the  solution  to  the 
two  dimensional  subproblems  in  both  Phase  1  and  Phase 
2.  The  second  used  a  dual  affine  approach  in  Phase  1  and 
did  not  use  the  two  dimensional  subproblem  solution  until 
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Phase  2.  The  former  did  not  perforin  as  well  as  the  latter 
and  therefore  only  the  results  of  the  latter  configuration  are 
reported  here.  Since  the  initial  feasible  solution  can  have 
a  significant  effect  on  a  method’s  overall  performance,  we 
are  now  investigating  Phase  I  procedures  other  than  those 
described  below.  (See,  e.g.,  Bar88  .) 

4.2.  Implementation  Details 

Starting  Values  and  Initial  Feasible  Points.  For 
each  of  the  problems  analyzed  in  this  study,  the  initial 
solution  is  uo  0.  A  big-M  Phase  I  procedure  (see, 
e.g.,  BJ77i)  is  used  to  obtain  an  initial  feasible  solution 
when  necessary.  This  is  implemented  by  adding  an  arti¬ 
ficial  variable  with  coefficient  1  to  every  row  in  .4.  The 
Phase  1  problem  is  then  solved  with  an  artificial  variable 
with  coefficient  Af  ^  10*  added  to  the  original  objective 
row.  The  Phase  2  problem  begins  once  the  value  of  the 
artificial  variable  becomes  negative  and  can  therefore  be 
removed. 

Scaling.  In  the  implementations  reported  in  this  paper, 
the  .4  matrix  is  not  scaled.  The  two-dimensional  subprob¬ 
lem  constraint  matrix  defined  by  (3.4)  has  been  scaled, 
however,  to  improve  the  numerical  stability  of  the  sub¬ 
problem  solution.  The  two  columns  of  this  matrix  are  con¬ 
structed  using  the  normalized  search  directions,  S|/  siilj 
and  ij/  ;S2  j.  respectively  (see  §3.2).  F.ach  row  of  the 
subproblem  constraint  matrix  is  then  scaled  to  have  norm 
1. 

Constraint  Dropping.  Constraints  that  are  sufficiently 
far  from  the  current  point  u.  i.e.,  those  having  residuals 
rj(u)  that  satisfy 

rj(ii)  >  10'^  X  min{rfc(u).t  l,...,m},  (I-S) 

are  explicitly  removed  from  the  computations.  Constraints 
y  that  satisfy  (4.5)  are  “dropped"  by  setting  77^  and  Djjio 
zero  prior  to  computing  .4^^. 4  and  .\^R.  This  improves 
the  sparsity  in  and  the  numerical  accuracy  of  the 

resulting  search  directions,  and  therefore  leads  to  improved 
performance. 

Steplength  Selection.  As  discussed  in  §3,  the 
steplength  for  the  dual  affine  method  is  generally  speci¬ 
fied  as  a  large  percentage  of  the  distance  to  the  boundary 
of  the  polytope.  In  Table  1,  two  sets  of  dual  affine  results 
are  listed,  differing  only  in  the  percentage  values  used.  The 
first  set  is  implemented  using  the  same  steplength  config 
uration  as  that  reported  in  M.M87].  i.e..  the  steplength  is 
99%  of  the  distance  to  the  boundary  of  the  polytope  for 
the  first  10  iterations,  and  90%  of  the  distance  thereafter. 
The  steplength  for  the  second  set  of  dual  affine  results  is 
99%  of  the  distance  to  the  boundary  of  the  polylope  for  all 
iterations. 


The  steplengths  for  the  two-step  method  are  specified 
independently  for  each  of  the  two  search  directions.  In 
the  the  dual  affine  direction,  the  steplength  is  99%  of  the 
distance  to  the  boundary  of  the  polytope.  In  the  recenter¬ 
ing  direction,  the  steplength  is  selected  using  a  standard 
quadratic  line  search. 

For  the  two-dimensional  subspace  procedures,  the 
search  direction  is  determined  by  the  subprob'em  solution. 
This  solution  provides  multipliers  that  take  the  current  es¬ 
timate  to  an  exterior  face  of  the  polytope.  The  steplength 
is  99%  of  the  distance  to  that  face. 

Solving  the  Two-Dimensional  Subproblem.  The 
two-dimensional  subspace  methods  are  solved  exactly  using 
the  general  purpose  simplex  method  implemented  in  I.M.SL 
routine  ZX4LP  ,IMS84;  and  a  dual  formulation  of  the  sub¬ 
problem.  Empirically,  the  number  of  pivots  required  for 
each  subproblem  was  found  to  be  less  than  15  in  most  cases. 

Stopping  Criteria.  Three  convergence  tests  are  used  to 
terminate  the  iterations.  Objective  function  convergence  is 
obtained  when 

■  <  10  * 

z, 

where  z,  is  the  objective  function  value  at  iteration  i. 
The  convergence  criterion  based  on  the  relative  difference 
between  the  primal  and  dual  objective  values  is  of  the  form 

^  <  'O'  * 
max|  z,  ) 

where  z;*  is  the  dual  objective  function  value  at  the  cur¬ 
rent  iteration.  This,  of  course,  ran  only  be  tested  when 
the  dual  multiplier  estimate  /)T4  (.4^D%4)  r  is  non¬ 
negative  (see  .\1M87  ).  The  third  convergence  criterion  is 
based  on  steplength,  where  convergence  is  observed  when 
the  steplength 

As  <  H)  ’*. 

Computing  Environment.  fhe  methods  reported 
here  were  implemented  in  ’ortran  and  executed  in  double 
precision  on  the  Cyber  205  at  the  National  Bureau  of  Stan¬ 
dards  central  computing  facility.  The  .1  matrix  is  encoded 
in  sparse  format  using  the  .XMF  experimental  mathemati¬ 
cal  programming  data  structures  described  in  Mar81  ,  and 
the  Hessian  is  encoded  and  solved  using  the  \'ale  sparse 
matrix  package  SMP.AK  SMP85  with  non  positive  defi¬ 
niteness  of  the  Hessian  handled  in  the  standard  way  by 
augmenting  the  diagonal  entries. 

4.3.  Test  Set  Description 

rile  methods  analyzed  in  this  study  were  tested  on  31  of  the 
54  publicly  available  linear  programming  problems  avail¬ 
able  <in  Netlib  through  Cay85  .  I  he  problems  omitted 
from  our  study  are  those  with  implicit  bounds,  which  our 
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implementations  do  not  currently  handle.  .All  but  3  of  the 
31  problems  analyzed  required  Phase  1  to  obtain  an  initial 
feasible  point  given  Uo  -  0.  Another  8  problems  do  not 
have  a  full  dimensional  interior  and  therefore  only  required 
Phase  1  to  find  the  optimal  solution.  The  remaining  20 
problems  required  both  Phase  1  and  Phase  2. 

4.4.  Observations 

Convergence.  Our  results  agree  well  with  the  accepted 
optimal  values  provided  in  [Gay85j.  With  few  exceptions, 
each  of  our  implementations  solve  the  problems  in  our  test 
set  “correctly”,  converging  to  the  accepted  value  with  at 
least  7  digits  of  agreement  BDDVV88  .  The  most  note¬ 
worthy  of  the  exceptions  is  problem  CzProb.  None  of  the 
methods  reported  here,  and  in  fact  none  of  the  methods  we 
tested,  converge  with  more  than  3  digits  of  agreement  for 
CzProb.  although  all  of  our  methods  do  converge  to  exactly 
same  value,  namely  iczProb  -  2182528.5. 

h'xcepting  CzProb,  both  variants  of  the  dual  affine 
method  agreed  with  the  accepted  values  for  all  of  the  re¬ 
maining  problems.  The  two-step  method,  however,  failed 
to  agree  for  one  other  problem,  E226  (relative  error  6e- 
6),  while  the  two-dimensional  subspace  method  failed  to 
agree  with  the  accepted  value  for  Shipl2l  (relative  error  = 
.3e-5). 

We  are  currently  investigating  methods  for  determin¬ 
ing  the  optimal  basis  from  interior  point  solutions  such  as 
these.  One  option  is  that  of  computing  the  Lagrange  multi¬ 
pliers  and  checking  for  dual  feasibility  at  suspected  optimal 
solutions.  “Restarting"  the  iterations  when  the  Lagrange 
multipliers  indicate  a  non-optimal  solution  has  been  found 
should  eliminate  the  problems  noted  here. 

Iteration  Counts.  Each  of  the  methods  reported  in  this 
paper  have  the  same  order  work  per  iteration.  Iteration 
counts  rather  than  execution  times  are  reported,  thus  hav¬ 
ing  the  advantage  of  making  these  results  comparable  over 
different  machines. 

Our  results  show  that  the  dual  affine  method  using  the 
99/99  steplength  rmifiguration  results  in  an  overall  reduc¬ 
tion  in  the  number  of  iterations  for  the  problems  in  our 
lest  set  when  compared  to  the  dual  affine  implementation 
using  the  99/90  steplength  configuration  used  by  |MM87j. 
While  this  reduction  is  generally  only  an  iteration  or  two, 
the  iteration  count  for  CzProb  is  decreased  by  6  iterations,  a 
relative  change  of  12%.  There  are  also  only  two  instances 
where  the  total  niirnber  of  iterations  increased  using  the 
99/99  steplength  variant,  in  one  case  by  1  iteration  and 
in  the  other  by  2.  These  results  thus  indicate  that  the 
99/99  configuration  is  preferable  over  the  99/90  configu 
ration.  .Note  that  our  99/99  dual  affine  results  also  com 
pare  favorably  with  those  reported  in  MMS88|.  We  thus 
use  the  99,99  steplength  configuration  of  the  dual  affine 
method  as  our  base  line  for  comparing  the  two  step  and 
two  dinoTisional  subspare  methods. 


Our  results  show  that  the  two-step  method  results  in  a 
decrease  in  the  number  of  iterations  almost  3  times  more 
often  than  it  results  in  an  increase  when  compared  to  the 
dual  affine  approach  with  a  99/99  steplength  configuration; 

•  16  of  the  problems  show  a  decrease  in  the  number  of 
iterations, 

•  6  of  the  problems  show  an  increase,  and 

•  9  of  the  problems  show  no  change. 

The  maximum  relative  decrease  in  the  iteration  count  is 
25%,  the  maximum  relative  increase  is  46%,  and,  on  the 
average,  the  relative  number  of  iterations  decreases  by  2%. 
There  is  no  obvious  difference  between  the  results  for  the 
first  half  and  those  for  the  second  half  of  the  problem  set. 
indicating  that  the  method  performs  equally  well  on  both 
the  smaller  and  larger  problems. 

The  results  for  the  two-dimensional  subspace  method 
are  significantly  better  than  both  the  dual  affine  or  two- 
step  methods.  L'sing  this  method,  the  number  of  iterations 
decreased  10  times  more  often  than  it  increased; 

•  20  of  the  problems  show  a  decrease  in  the  number  of 
iterations, 

•  2  of  the  problems  show  an  increase,  and 

•  9  of  the  problems  show  no  change,  of  which  8 
are  Phase  1  problems  and  therefore  cannot  show  a 
change.  (See  §4.1.) 

The  maximum  relative  decrease  in  the  iteration  count  is 
41%.  the  maximum  relative  increase  is  11%,  and,  on  the 
average,  the  relative  number  of  iterations  decreases  by  16% 
(12%  counting  the  8  Phase  1  problems  in  the  total  number 
of  problems).  Again,  there  is  no  obvious  difference  between 
the  results  for  the  first  half  and  those  for  the  second  half 
of  the  problem  set. 

4.5.  Conclusions 

The  results  of  this  study  demonstrate  the  computational 
advantage,;  of  using  recentering  ideas  and  more  sophisti¬ 
cated  adaptations  to  the  traditional  method  of  centers.  In 
particular,  our  two  dimensional  subspare  procedure  pro¬ 
duces  results  that  are  a  significant  improvement  over  the 
dual  affine  method,  reducing  the  number  of  iterations  by 
an  average  of  16%.  1  he  two-dimensional  subspace  results 
are  also  competitive  with  the  dual  affine  results  reported  in 
Monma  and  Morton  MM87  and  with  the  primal  dual  inte¬ 
rior  point  results  reported  in  and  MrShane  tt  al.  NIMSSS  . 
The  procedures  presented  do  not  increase  the  order  of  the 
work  required  per  iteration,  and  ran  be  implemented  easily. 
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Table  1:  Iteration  Counts 


Dual  Affine  ^  2-Step  2  D 
I  (99/90)  ;  (99/99)  j  Subspare  ! 


Phase  1/ 

Phase  1  / 

Pliase 

1/ 

Phase  1  / 

Name 

Total 

Total 

Total 

Total 

-Afiro 

j 

1  / 

21 

1  / 

20 

1  / 

20 

1  / 

13 

ADlittle 

1  / 

22 

1  / 

21 

1  / 

21 

1  / 

21 

Scagr7 

3  / 

24 

3  / 

23 

3  / 

23 

3  / 

21 

Sc205 

4  / 

28 : 

4  / 

26 
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AN  APPLICATION  OF  QUASI-NEWTON  METKOUS  TO  PARAMETRIC  EMPIRICAL  BAYES  ESTIMATION 
David  Scott,  Universite  de  Montreal 


Introduction 

This  article  discusses  some  numerical  methods 
connected  with  variance  estimation  In  an  empir¬ 
ical  Bayes  setting.  Our  principal  focus  is  the 
Iterative  EM  process  of  Dempster,  Laird,  and 
Rubin  (1977)  as  applied  to  parametric  empirical 
Bayes  problems  In  which  the  prior  distribution 
belongs  to  a  regular  exponential  family  of  dis¬ 
tributions  with  unknown  variance  (and  possibly 
other  unknown  hyperparameters).  Because  the  EM 
process  can  converge  slowly  nd  because  each 
Iteration  Involves  the  calculation  of  a  poster¬ 
ior  expectation,  numerical  methods  designed  to 
contain  the  computational  burden  In  this  method 
are  of  Interest.  A  method  which  assumes  normal¬ 
ity  of  the  posterior  distribution  In  order  to 
simplify  EM  calculations  has  been  proposed  by 
Laird  (1978)  based  on  a  suggestion  by  Leonard 
(197  5).  The  main  contribution  of  our  research  is 
the  use  of  a  quasi-Newton  approximation  to  the 
observed  Information  matrix  In  cases  where  the 
Leonard-La  Ird  approximation  is  used  in  the  EM 
Iterations.  We  show  that  In  practice  the  quasi- 
Newton  methods  give  roughly  the  same  degree  of 
accuracy  in  variance  estimation  as  Newton- 
Raphson  methods,  and  can  allow  considerable 
savings  In  computation  time  In  problems  where  a 
large  number  of  parameters  must  be  estimated. 

Empirical  Bayes  and  the  E.M  process 

One  of  the  main  contributions  of  the  work  on 
maximum  likelihood  with  missing  observations  by 
Dempster,  Laird,  and  Rubin  (197  7)  was  their 
demonstration  that  the  unknown  parameters  In  an 
empirical  Bayes  problem  could  be  treated  as 
"missing  data"  in  an  overall  statistical  model 
and  then  estimated  by  applying  a  general  model 
for  ML  estimation  from  incomplete  data.  Nominal¬ 
ly,  one  estimates  the  hyper  para  meters  of  the 
overall  model;  the  empirical  Bayes  parameter 
estimates  fall  out  as  by-products  of  the  hyper- 
pararaetar  estimation.  This  process,  to  which 
Dempster  et  al .  gave  the  name  "EM"  to  emphasize 
Its  Iterative  use  of  an  expectation  of  missing 
values  (the  E  step)  to  carry  out  maximum  likeli¬ 
hood  (the  M  step)  had  been  used  before  on  many 
occasions.  Dempster  et  al.  s flowed  the  generality 
and  usefulness  of  the  EM  procedure  in  situa¬ 
tions,  like  parametric  empirical  Bayes,  in  which 
the  connection  with  the  missing  data  problem  had 
not  previously  been  apparent . 

The  EM  process  becomes  particularly  interest¬ 
ing  if  the  complete  statistical  model  of  the 
parameters  and  the  observations  is  a  member  of  a 
regular  exponential  family,  because  it  then 
becomes  necessary  to  only  calculate  the  poster¬ 
ior  expectation  of  the  sufficient  statistic  for 
the  hyperpararaeter,  ra'her  than  all  of  the 
missing  data,  during  the  E-step  at  each  itera¬ 
tion.  In  the  corresponding  M-step  this  expected 
sufficient  statistic  is  used  to  calculate  a  new 
maximum  likelihood  estimate  of  the  liypcrpara- 
meter . 

The  general  form  of  the  EM  calculations  for 


data  sampled  from  an  exponential  family  Is  as 
follows.  Let  1  Indicate  a  vector  of  hyperpara¬ 
meters,  A  a  vector  of  parameters,  and  x  a  vector 
of  observed  values.  Given  1,  the  density  of  the 
parameters,  p('  f).  Is  assumed  to  belong  to  an 
exponential  family.  The  density  of  the  observa¬ 
tions  given  A,  f(xiA),  will  In  general  depend  on 
A  but  not  f .  The  joint  model  of  x  and  A,  ignor¬ 
ing  a  constant  of  proportionality.  Is 

r(x.  All)  a  E(xiA  )p(Ail  )  .  (1) 

If  we  assume  an  initial  estimate  Iq  of  1,  then 
the  p^^  EM  iteration  is: 

E-step :  Given  an  estimate  Ip,  calculate  the 

posterior  expectation  of  the  sufficient 
statistic  t  for  V  : 

tp=  E^(tlx,  Ip)  (2) 

This  calculation  will  In  general  also 
yield  an  estimate  Ap  of  the  para¬ 
meters  . 

M-step:  Given  tp,  calculate  a  new  estimate  of  1 

using  maximum  likelihood. 

Vi'  • 

It  generally  seems  to  be  the  case  in  EM-type 
calsulat  tons  for  exponential  families  that  once 
an  L-step  has  been  performed  the  corresponding 
M-step  Is  straightforward.  The  E-step,  consist¬ 
ing  as  it  does  of  a  posterior  expectation,  poses 
more  Important  numerical  problems.  The  obvious 
■ipproach  Is  to  use  numerical  integration,  which 
for  hlgh-dimenslonal  problems  can  be  very  time 
consuming.  Laird  (I97a)  proposed  to  solve  this 
problem  by  using  an  approximation  which  appar¬ 
ently  originated  with  Leon  ,rd  (197  5)  and  which 
has  been  used  in  m..ny  recent  studies  using  para¬ 
metric  empirical  Bayes  (e.g.,  Wong  and  Mason, 
1985;  Toraberlin,  1988).  We  first  note  that  the 
posterior  distribution  of  A  given  x  and  1  is 
proportional  to  (1).  Our  first  assumption  is 
that  the  posterior  mean  of  A  Is  the  posterior 
mode.  This  mode  can  be  found  by  optimizing  (1) 
with  respect  to  A.  Second,  we  assume  that  the 
observed  Information  matrix: 


- -  log  r(x,.‘ 

)|a^ 

3AA^ 

1  P 

accurately  represents  the  posterior  covariance 
r.  .trix  of  A  .  If  one  of  the  components  of  i  Is  a 
variance  component  for  A,  then  it  will  often 
hapjxin  that  the  sufficient  statistic  for  this 
variance  component  involves  t*’e  observed 
Information  matrix.  Laird  (  1978)  n  .es  that  the 
two  conditions  of  (I)  symmetry  and  (li)  poster¬ 
ior  covariance  matrix  equal  to  are  satis¬ 

fied  if  wo  assume  th.it  (1)  is  proportional  to  a 
Normal  density. 

if  we  use  tlie  Newton-Rapiison  metliod  tor 
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unconstrained  optimization  as  a  means  for  carry¬ 
ing  out  the  calculations  In  the  E-step  (2),  we 
have  available,  at  the  optimum  ^  for  the  p^ ' 
Iteration,  the  matrix  Hp  of  second  derivatives 
at  A-ip.  The  Inverse  of  this  matrix  Is  the  neg¬ 
ative  of  the  observed  Information  matrix  (3). 

Our  Investigation  of  numerical  methods  In 
parametric  empirical  Bayes  concerns  the  Imple¬ 
mentation  of  this  approximation.  Before  proceed¬ 
ing  to  a  discussion  of  numerical  methods,  we 
Illustrate  the  EM  calculations  using  the  example 
of  empirical  Bayes  estimation  of  the  scale  para¬ 
meters  In  the  Bradley-Terry  paired  comparison 
model. 


An  example 


In  this  section  we  present  an  example  of 
parametric  empirical  Bayes  estimation  In  a  clas¬ 
sical  statistical  paradigm,  the  method  of  paired 
comparisons  (Bradley  and  Terry,  1952).  Consider 
an  experiment  Involving  comparisons  between  a 
set  of  K  objects  by  a  set  of  N  experimental 
subjects.  Objects  1  and  j,  for  l,j  =  l,...,K  are 
compared  n^j  times;  the  n^j  need  not  be  equal  to 
N  and  in  fact  need  not  be  equal  to  each  other. 
The  data  consist  In  a  matrix  of  counts  X=(xj^j(, 
for  l,j  =  l,...,K,  in  which  x^^j  represents  the 
number  of  times  object  1  Is  preferred  to  object 
j  In  the  j  times  which  these  two  objects  are 
compared.  We  assume  that  there  are  no  ties, 
hence  that  for  ^11  l,j  and  that 

Xii=  0  for  all  1. 

Given  njj,  we  assume  that  x^j  Is  distributed 
according  to  a  Binomial  distribution, 

where  "  j  Is  the  probability  that  object  I  will 
be  preferred  to  object  j  In  a  single  comparison. 
Following  Bradley  and  Terry  (1952),  we  propose 
the  following  model  for  tlie 


where  the  C  ,  k=l,...,K  are  parameters  to  be 
estimated.  Since  "  ^  .  Is  monotone  Increasing  In 
(0 J -  ^j)>  this  model  unambiguously  defines  a 
scale  on  which  we  can  rank  the  K  objects  being 
compared,  l.e.,  object  I  Is  "ranked  higher" 
than  object  j  If  and  only  If  0^  >  G,. 

The  G so  defined  are  only  unique  up  to  a 
change  In  location.  In  that  a  constant  can  be 
added  to  all  the  without  affecting  the  value 
of  any  of  the  "  j  j  •  A  constraint  must  therefore 
be  placed  on  the  for  them  to  be  estimable.  Wo 
Impose  this  constraint  In  the  following  way:  we 
arbitrarily  choose  one  of  the  Oj^,  which  wit.  out 
loss  ot  generality  we  call  and  then  we 

reparameterize  the  problem  In  terms  of  the  K-l 
pa  ra  meters 


for  k*l,...,K-l.  This  repa  raraet  e  rl  zat  Ion  is 
equivalent  to  fixing  the  origin  of  the  scale 
defined  In  (1)  at  an  arbitrarily  chosen  para¬ 
meter  value.  Note  that  we  can  still  rank  the  K 
objects  using  the  Instead  of  tlie  0|^,  by  say¬ 


ing  that  1  Is  "ranked  higher”  than  j  If  >  A  . 
for  both  l.j^K,  and  1^  K  Is  "ranked  higher" 
("ranked  lower”)  than  KlfAj>0(A^<  0). 

Our  empirical  Bayes  approach  to  estimating 
the  Is  Inspired  by  the  approach  to  estimation 
In  log-linear  models  given  by  Laird  (1978).  We 
consider  that  the  Aj^  are  (lid)  Normal  (0,0^), 
where  is  a  variance  hyper  para  meter  which  must 
be  estimated  from  the  data.  Thus  the  liyperpara- 
meter  S’  consists  of  the  single  component  .  We 
estimate  through  the  EM  process,  using  the 

Leonard-La  Ird  approximation  to  the  posterior 
distribution  of  A. 

We  first  establish  that  the  overall  statist¬ 
ical  model  for  the  "complete  data"  (x,A  )  belongs 
to  an  exponential  family.  This  Is  not  difficult 
since,  as  we  see  presently,  the  joint  density  of 
the  observations  x  given  the  parameters  does  not 
Involve  the  hyperparameter,  and  the  joint  dens¬ 
ity  of  A  belongs  to  an  exponential  family.  Thus 
the  overall  statistical  model  of  (x,A  )  must 
belong  to  an  exponential  family. 

Under  our  assumptions,  the  likelihood  of  the 
observations  Is 

9  K  K  /rii  A  Xi  i  X,. 

f(xlA,o2)  =  n  n  V  M^li  ^  ”  il  ^ 

1=1  j=l+l  V’^lj/  J*- 

while  the  joint  density  of  the  parameters  Is 

-(ir)  •'''■(■ 

The  expression  for  the  model  of  the  "complete 
data"  can  then  be  written,  after  taking  logar¬ 
ithms  and  performing  some  algebraic  manipula- 
t  Ions , 

log  r(x,Alo2)  =  -  V  log  o2 


+  d(x,A) 


a2  -  Kd.  iog  02 

K  2 


where  d(x,A)  does  not  depend  on  the  hyperpara¬ 
meter  0^.  Note  that  the  two  explicit  terras  In 
(8)  originate  In  the  prior  density  (7).  Also 
note  that  (8)  Is  in  a  form  where  the  sufficient 
statistic  for  is  leadlly  apparent: 


The  KM  calculations  for  this  application 

2 

become  the  following.  Let  Oq  denote  an  Initial 
estimate  of  o^.  The  Iteration  of  the  LM 

procedure  Is  then: 

h-step:  2\ssurae  that  an  estimate  o  ^  of  o  ^  Is  at 

hand.  Calculate 

s2  =  E^;stx,  o2)  =  E^.A^AI  X,  o  2j  (9) 

which  Is  the  posterior  expectation  of 
2  '? 
s  ,  the  sufficient  statistic  for  o  . 

M-step:  Calculate  the  maxlraurn  likelihood  estlm- 

— - ^  2  2  2  2 

ate  ^  given  that  s  =  Sp. 


This  estimate  Is  o' 


A  demonstration  tliat  this  process  must  ultimate¬ 
ly  converge  may  be  found  in  Dempster  et  a  1  . 
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(1977). 

The  Leonard-Laird  approximation  allows  an 
Important  simplification  in  the  E-step  calcula¬ 
tions  (9).  Since  we  assume  that  -Hp  is  the  pos¬ 
terior  covariance  matrix  of  A,  and  that  the 
posterior  mean  of  A  Is  equal  to  Its  posterior 
mode  Ap,  it  is  easy  to  show  that 

E^(aTaIx,  o2]  =  Ap  +  tr  (10) 

This  expression  only  involves  quantities  which 
are  readily  available  at  the  termination  of  a 
straightforward  application  of  the  Newton- 
Raphson  method  to  find  Ap. 

In  the  remainder  of  this  article  we  discuss 
alternative  methods  for  performing  the  calcula¬ 
tions  in  (10).  Our  particular  interest  centers 
on  what  changes  occur  in  the  quantity  (10)  if  we 
use  a  quasi-Newton  approximation  to  Hp  instead 
of  calculating  second  derivatives.  This  inter¬ 
est  stems  from  the  potential  for  considerable 
savings  in  computation  time  if  this  quasi-Newton 
approximation  is  applied. 

Numerical  implementation  of  Newton-type  methods 
in  the  EM  algorithm 


In  the  application  of  the  EM  process  to 
variance  estimation  in  empirical  Bayes  problems 
such  as  the  one  we  describe,  the  Inverse  of  the 
matrix  of  second  derivatives  of  the  complete 
data  density  (the  "observed  information  matrix") 
is  used  in  constructing  successive  estimates  of 
the  variance  hyperparameter.  This  matrix  is 
naturally  available  if  one  uses  the  Newton- 

Raphson  procedure  to  carry  out  unconstrained 

(0)  (1) 

optimization  when  estimating  A  .  Let  A  ,  A  , 
?2)  P  P  P 

A  ,...  be  the  approximations  to  A  which  are 
P  ^  P 

generated  during  the  Newton-Raphson  ^^ocedure 
applied  to  the  computations  (10).  The  n  itera¬ 
tion  of  this  procedure  can  be  written,  for 
n=0, 1 ,2 . 

s<")  =  g(")  (lla) 

A(n+1)  ^  ^(n)  ^  g(n) 

where  s^"^  is  a  search  direction,  is  the 

vector  of  first  derivatives  (with  respect  to  A1 
of  log  r(x,Alo2)  evaluated  at  an  estimate  A^"' 
of  Ap,  and  H^^^  is  the  matrix  of  second  derivat¬ 
ives  (the  Hessian  matrix)  evaluated  at  the  same 


point.  At  convergence  of  the  Newton-Raphson 

procedure,  A_  is  taken  to  be  the  final  Iterate 
P 

in  the  sequence  generated  by  (11)  and  the 
observed  information  matrix  Hp^  is  the  Inverse 
of  the  Hessian  evaluated  at  Ap. 

From  (10),  however,  we  see  that  at  each 
E-step  in  the  EM  process  we  do  not  need  the 
entire  observed  Information  matrix  but  only  its 
trace.  We  car  thus  use  an  elementary  result  in 
numerical  linear  algebra  (e.g.,  Golub  and  Van 
Loan,  1983)  to  avoid  having  to  explicitly  calcu¬ 
late  the  Hessian  inverse  at  all.  Any  posltiv^ 
definite  matrix  M  can  be  decomposed  as  M”AA 
where  A  is  nonsingular  and  lower  triangular. 


Then  .M  '■=C^C  where  C=A  ,  and  tr  M  ''=  II  Cll  p  where 
II  .11  is  the  Frobenius  norm.  This  result  Indicates 
that  the  quantity  of  primary  interest  for  numer¬ 
ical  calculaL  of  a  variance  estimate  in  the 
application  we  are  describing  is  the  Cholesky 
factor  of  Hp  ,  that  is,  the  lowe r- 1 r i angu lar 

matrix  L„  such  that  L„L„^  =  H„.  in  addition, 
P  P  P  P 

since  this  factor  is  triangular,  its  Inverse  is 

also  triangular  and  this  fact  can  be  incorpor¬ 
ated  into  very  fast  algorithms  for  calculating 
the  Frobenius  norm  of  (I'p)  \  which  is  nothing 
more  than  the  sum  of  its  squared  elements. 

In  fact,  modern  Implementations  of  the 
Newton-Raphson  procedure  do  compute  the  Cholesky 
factor  of  the  Hessian  at  each  iteration,  rather 
than  carry  out  the  calculations  exactly  as  given 
in  (li).  A  fast,  straightforward  implementation 
of  the  Newton-Raphson  procedure  can  be  coded  in 
FORTRAN,  for  example,  using  the  subroutines  for 
positive  definite  matrices  available  in  UNPACK 
(Dongarra,  Bunch,  Moler,  and  Stewart,  1979).  For 
our  problem,  the  Cholesky  decomposition  as  coded 
in  UNPACK  uses  on  the  order  of  (l/6)(K-l)^ 
arithmetic  operations,  in  constrast  to  the 
approximately  (K-1)^  operations  which  are  neces¬ 
sary  to  explicitly  form  the  inverse  of  any  of 
the 

A  second  approach  to  the  calculation  of  Lp 
directly  approximates  Lp  using  quasi-Newton 
methods.  These  methods,  also  called  "variable 
metric"  methods,  have  a  long  history  of  applica¬ 
tion  to  the  solution  of  systems  of  nonlinear 
equations  and  unconstrained  and  constrained 
optimization.  Helpful  background  on  quasi-Newton 
methods  may  be  found  in  the  review  article  by 
Dennis  and  Morg  (1977)  and  in  the  text  by  Dennis 
and  Schnabel  (1983). 

The  basic  idea  of  quasi-Newton  methods 
applied  to  the  nonlinear  optimization  in  each  E 
st^  is  as  follows.  Given  the  initial  estimate 
of  H  ,  we  form  successive  iterates  H^^', 


of  H  , 
P  P 


whe  re  U 


iq(n-l)  +  y(n) 

P  P 

is  a  matrix  of  rank  two. 


generally  some  function  of  A^"\  A^''  g^"^and 
g(n  l)_  Thus  a  quasi-Newton  method  does  not 
calculate  second  derivatives  but  uses  only 
first-order  information.  The  matrix  is  then 

used  in  place  of  in  Che  iCerac?on  (11)  Co 

calculate  a  new  estimate  of  Ap. 

A  large  part  of  the  acceptance  of  quasi- 
Newton  methods  in  optimization  stems  from  the 
ability  to  directly  compute  the  Cholesky  factor 

L^''^  of  H^"^  from  the  Cholesky  factor  L^''  of 
P  P  P 

H^"  Several  authors  have  proposed  efficient 

P 

and  numerically  stable  methods  for  doing  so 
(see,  for  example.  Gill,  Golub,  Murray,  and 
Saunders,  1974;  Goldfarb,  1976).  Importantly,  if 
we  have  M  parameters  to  estimate  these  methods 
can  carry  out  the  update  in  0(m7  )  operations,  as 
opposed  to  the  0(:i^  )  operations  required  to 
explicitly  form  H^"^  and  decompose  it.  Thus  if  M 
is  large,  quasi-Newton  approximations  hold  a 
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certain  promise  for  carrying  out  EM  calculations 
with  reduced  computational  effort. 

Two  quasi-Newton  updates  in  particular  have 
attracted  the  attention  of  researchers.  _^Each 
has  the  very  useful  t.iupe..ty  that  if  Up  '  is 
positive  definite,  then  is  as  well.  These 

updates  are  named  with  the  initials  of  the 
researchers  who  initially  studied  them.  The  DFP 
update,  named  after  Davidson,  Fletcher,  and 
Powell,  is  perhaps  the  best  known.  The  rank-two 
update  (12)  characterizing  the  DFP  method  is: 


U  = 

^DF 


=  —  (qy^  +  yq^)  - 

r  rr 


T 

s  y 


(s^)2 


yy 


where 


,(n) 


i(n-l). 

P 


=  Jn)  _  (n-1) 

■Sp  5p 


and  s  =  -  A  The  BFGS  update,  named 

for  its  discoverers  Broyden,  Fletcher,  Goldfarb, 
and  Shanno,  is 

sV  s  P 

where  s  and  y  are  as  defined  above  and 
P  =  s. 

The  DFP  update  was  originally  devised  to 
yield  a  good  approximation  to  the  analytic 
Hessian.  See  Dennis  and  MorS  (1977)  for  a  dis¬ 
cussion  of  the  nature  of  this  approximation. 
The  BFGS  update  is  "complementary"  to  the  DFP  in 
the  sense  that  it  provides  the  same  sort  of 
approximation  to  the  inverse  of  the  analytic 
Hessian  that  the  DFP  provides  to  the  Hessian 
Itself.  Since  the  calculation  (10)  involves  the 
negative  of  a  Hessian  Inverse,  we  are  naturally 
interested  in  the  BFGS  update.  In  addition,  the 
current  consensus  seems  to  be  that  the  BFGS  is 
the  most  successful  quasi-Newton  update  in 
practice  (see,  for  example,  Dennis  and  Schnabel, 
1983,  and  the  references  cited  therein). 

In  this  research  we  have  investigated  a 
quasi-Newton  optimization  method  using  a  BFGS 
update  as  an  alternative  to  the  Newton-Raphson 
procedure  in  the  computations  leading  to  (10). 
In  the  next  two  sections  we  discuss  some  Issues 
arising  from  the  implementation  of  this  method, 
and  give  some  numerical  results. 


Implementation  of  quasi-Newton  techniques 


The  Newton-Raphson  method  is  well  known  to 
practitioners  since  the  method  adapts  Itself 
readily  to  a  wide  variety  of  problem  situations. 
Quasi-Newton  techniques  are  not  as  well  known 
and  need  more  careful  implementation  if  they  are 
to  be  useful.  In  this  section  we  discuss  certain 
practical  problems  which  arise  from  the  use  of 
quasi-Newton  methods,  many  of  which  would  not  be 
present  in  an  analogous  application  of  Newton- 
Raphson  . 

The  most  Important  issue  is  that  of  finding  a 
good  initial  approximation  to  Hp .  If  the  initial 

estimate  of  A  is  zero,  the  use  of  H'  '  I ,  the 
P  P 

Identity  matrix,  is  often  a  bad  choice.  In  many 
statistical  applications.  Including  the  one  we 
have  described  in  this  article,  there  is  often 


an  a  priori  reasonability  to  using  the  origin  as 
the  initial  estimate  A^*^^  of  the  parameter 
vector,  but  the  likelihood  function  is  often 
very  poorly  behaved  at  t'...;  origin  and  the  gradi¬ 
ent  evaluated  there  may  be  large.  If  this  is  the 
case,  then  by  (II)  the  next  iterate  A^^^  may  be 
a  very  poor  estimate  of  Ap.  In  addition,  the 
quasi-Newton  iterations  build  up  approximate 
second-order  information  based  on  calculated 
first-order  information.  Thus  if  the  initial 
estimates  of  Ap  are  too  off  the  mark  then  much 
of  the  early  quasi-Newton  updating  will  be 
counte  rpr  oduct  Ive . 

Determining  an  initial  Hessian  approximation 
is  known  as  "scaling"  the  optimization  problem 
since  experience  has  shown  much  better  behavior 
of  quasi-Newton  techniques  if  the  initial  iter¬ 
ate  A^''^  is  of  roughly  the  same  magnitude  as  Ap. 
This  scaling  problem  does  not  come  up  in  the 

Newton-Raphson  method  because  the  Iterate  A^*^^ 

P 

is  a  function  of  the  calculated  Hessian  at  A 

P 

In  relatively  well-behaved  problems,  such  as 
those  arising  in  many  estimation  problems  from 
regular  exponential  families,  this  property  of 
Newton-Raphson  allows  It  to  create  useful  iter¬ 
ates  at  an  early  stage. 

Many  texts  on  practical  optimization  tech¬ 
niques  (for  example.  Gill,  Murray,  and  Wright, 
1981;  Dennis  and  Schnabel,  1983)  suggest  the  use 
of  line  searches  to  mitigate  the  effect  of  a 
naive  choice  for  the  initial  Hessian  approxima¬ 
tion.  When  line  searches  are  used,  the  second 
step  of  the  iteration  (11)  is  modified  to 

=  a(")  s(")  (lib’) 

where  is  a  scale  value  which  is  determined  by 
first  calculating  s^"^  using  (11a)  and  then 
carrying  out  a  unidimensional  search  along  s^"^ 
to  find  a  which  satisfies  certain  condi¬ 

tions.  In  principle  such  searches  may  be  useful. 
However,  most  accepted  line  search  procedures 
involve  function  evaluations.  Since  in  many 
statistical  applications  (such  as  the  one  we 
consider  here)  the  function  to  be  minimized  is 
the  log  of  a  product,  hence  the  sum  of  many 
logs,  the  function  is  extremely  expensive  to 
evaluate.  The  additional  expense  of  function 
evaluations  during  line  searches  may  outweigh 
any  efficiency  advantages  which  might  be  gained 
from  using  approximations  to  the  Hessian 
matrix. 

In  this  research  we  have  used  the  following 
procedure  for  Initializing  the  Hessian  approx¬ 
imation: 

(a)  at  the  first  EM  iteration,  we  take  “ 

/O') 

H\  '  that  is,  we  initialize  the  approximate 
Hessian  using  the  analytic  Hessian  evaluated 
at  A»Aq. 

(b)  for  all  subsequent  EM  iterations  we  take 
Ti^p^  =•  Yp  Dp,  where  Yp  is  a  scalar  and  Dp  is 
a  diagonal  matrix. 

The  matrix  Dp  is  constructed  following  a 
suggestion  by  Dennis  and  Schnabel  (1983)  that 
prior  knowledge  about  the  magnitude  of  the  para¬ 
meter  estimates  should  be  used  to  scale  the 
optimization  problem.  Let  (Dp)^  denote  the  1*'*’ 
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diagonal  element  of  .  We  set 

(Dp)i  =  max  [  1,  t  ('ip-i)!!] 

where  (A  is  the  i*"^  component  of  *p_i"  We 

derive  the  constant  Y  by  considering  the  form 
of  the  iteration  (ll)-  By  taking  first  and 
second  derivatives  of  (8)  it  can  be  shown  that 
each  of  the  components  of  the  gradient  vector  is 
a  sum  of  K-1  terms,  each  of  which  involves  a 
sample  size  nj^j.  Furthermore,  each  of  Che  ele¬ 
ments  of  the  diagonal  of  the  analytic  Hessian  is 
such  a  sum.  Since  we  want  both  sides  of  (11a)  to 
be  roughly  on  the  same  scale,  we  take 


where 


=  n.(.K-U. 

P  ^ 


K-1  K 
I  I 


K(K-l)  i  =  l  j=i+l 


"ij 


(14) 


is  the  average  sample  size.  The  expression  (14) 
would  be  equal  to  each  of  the  diagonal  elements 
of  the  analytic  Hessian  evaluated  at  A=0,  in  the 
case  where  nj^j=  n  for  all  1  and  j. 

This  procedure  for  initializing  the  Hessian 
approximation  is  a  compromise  between  using  the 
high-quality,  but  expensive,  scaling  information 
available  in  the  Newton-Raphson  procedure  and 
the  less  expensive,  and  less  reliable,  technique 
of  using  a  diagonal  matrix.  We  do  the  former 
when  our  prior  information  about  A  is  poor,  at 
the  first  Iteration;  we  do  the  latter  when  our 
prior  Information  about  A  is  better. 

In  fact,  we  use  0  as  a  starting  value 

only  at  p=l,  the  first  iteration  of  the  EM. 
Since  the  EM  process  solves  a  sequen.e  of  sim¬ 
ilar  optimization  problems,  *p_i  provides  a  very 
good  estimate  of  A^.  Therefore,  for  p=2,  3,... 
we  cake 


a(0)  =  aA  , 

p  p-1 


where  ae(0,l]  is  a  shrinking  factor.  The  shrink¬ 
ing  factor  is  applied  to  prevent  the  approxima¬ 
tion  to  A  from  being  too  good.  In  order  to 

allow  sufficient  quasi-Newton  Iterations  to 

- 1 

build  up  a  reasonable  approximation  to  Hp 
before  convergence  occurs.  A  premature  conver¬ 
gence  of  the  quasi-Newton  iterations  may  cause 
the  EM  iterations  not  to  converge. 

We  have  computed  our  BFGS  updates  by  applying 
two  Independent  rank-one  updates,  according  to 
formula  (13),  to  an  LDL^  decomposition  of  hI"^ 
according  to  algorithm  Cl  of  Gill,  Golub, 
Murray,  and  Saunders  (1974).  Other  methods  are 
available  to  perform  rank-one  updates,  and  to 
directly  compute  a  rank-two  update  in  a  single 
subroutine  call.  Some  of  these  methods  may 
provide  better  numerical  stability  in  hard  prob¬ 
lems,  but  they  are  slower. 


Some  numerical  results 


In  this  section  we  present  results  from  a 
comparison  of  quasl-Newton  and  Newton-Raphson 
methods  In  carrying  out  empirical  Bayes  estima¬ 


tion  of  the  scale  parameters  of  the  paired- 
comparison  model.  We  use  both  real  and  simulated 
data.  Using  the  real  data,  we  show  that  both 
methods  give  virtually  Identical  results. 
Unfortunately,  each  of  the  real  data  sets  is  too 
small  to  show  any  computational  benefit  from 
using  quasi-Newton  methods  (in  fact,  the  quasi- 
Newton  Iterations  are  much  slower).  We  have 
therefore  simulated  large  data  sets  in  order  to 
give  an  idea  of  the  kind  of  savings  in  computa¬ 
tion  time  which  might  be  expected  from  using 
quasl-Newton  techniques  on  large  problems. 

In  the  case  of  both  the  Newton-Raphson  and 
quasl-Newton  methods  we  have  used  initial  estim¬ 
ates  A^*^^  =  0  and  A^®^  =aAp_]^  for  p=2,3,...  The 
results  reported  below  use  a  =  .8.  In  testing  we 
used  a  =  .9  as  well,  but  Interestingly  the  smaller 
value  of  a  induced  fewer  quasi-Newton  itera¬ 
tions.  The  effect  of  using  a  >  0  on  Newton- 

Raphson  was  to  reduce  the  number  of  Newton 
iterations  by  about  a  third,  although  there  was 
no  discernible  difference  in  effect  between  the 
two  values  of  n . 

The  Newton-Raphson  and  quasi-Newton 
tions  are  terminated  when 


itera- 


yi(n)  _ 

P  P  “ 
iiaC"-i)ii„ 

p  CO 

and  the  EM  iterations  are  terminated  when 


Pp-2.J 


<  e 


P-1 


where  6  >  0  is  an  error  tolerance.  We  have  used 
e=10-^  . 

The  simulated  data  sets  were  generated  by 
drawing  M=K-1  values  Aj^*  from  a  Normal  (0,1) 
distribution,  for  M=o0,  80,  100,  150,  and  200. 
These  simulated  parameters  were  then  used  in 
binomial  experiments  to  generate  data  matrices 
|xj^j*}  according  to  the  Brad ley-Te rry  model  (4). 
We  have  used  nj^j=  50  for  all  i  and  j  in  all 
slmulat  Ions  . 

Our  numerical  results  are  summarized  in  Table 
1,  where  we  report,  for  both  Newton-Raphson  and 
quasl-Newton  methods,  the  number  of  EM  itera¬ 
tions  required,  the  average  number  of  Newton 
iterations  per  EM  iteration,  the  estimated  , 
and  the  approximate  computation  time. 

For  each  of  the  simulated  data  sets  we  com- 
2 

pute  s^  ,  the  empirical  variance  of  the  gener¬ 
ated  Aj^*.  Each  of  the  variance  estimates  gener¬ 
ated  by  Newton-Raphson  and  by  quasl-Newton 
approaches  in  very  close  to  the  corresponding 
s^ .  In  all  cases,  in  fact,  the  Newton-Raphson 
and  quasl-Newton  methods  give  virtually  ident¬ 
ical  variance  estimates. 

We  note  that  In  the  case  of  tlie  simulated 
data  sets,  the  size  of  the  problem  has  very 
little  effect  on  the  number  of  Newton  or  EM 
iterations  required  to  converge.  In  the  case  of 
the  real  and  the  simulated  data,  the  quasl- 
Newton  method  requires  3.5  -  4  times  as  many 

iterations  per  EM  Iteration.  Again,  this  ratio 
seems  to  be  independent  of  the  size  of  the 
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problem. 

The  column  labelled  "QN  advantage"  gives  the 
ratio  of  the  Newton-Raphson  time  to  the  quasi- 
Newton.  For  problems  in  which  M,  the  numbers  of 
parameters,  is  small,  Newton-Raphson  is  clearly 
faster.  As  M  gets  large,  however,  the  quasi- 
Newton  advantage  seems  to  approach  M  Itself. 
Such  behavior  is  to  be  expected,  as  each  Newton- 
Raphson  iteration  involves  0(M^  )  arltVimetlc 
operations,  while  each  quasi-Newton  iteration 
Involves  0(t^  )  such  operations. 

Discussion 

In  this  research  we  have  applied  quasi-Newton 
methods  in  a  parametric  empirical  Sayes  setting 
where  the  EM  process  is  used  to  estimate  a  vari¬ 
ance  hyperparameter.  We  have  shown  that  the 
variance  estimates  calculated  using  our  tech¬ 
niques  are  virtually  identical  to  those  calcu¬ 
lated  using  Newton-Raphson  methods.  In  addition, 
the  quasi-Newton  methods  use  substantially  less 
computation  time  in  large  problems. 

The  potential  for  application  of  quasi-Newton 
methods  is  great  in  statistics,  not  only  as  part 
of  the  EM  process  but  more  generally.  A  very 
fertile  area  for  further  research  is  in  the 
scaling  of  quasi-Newton  optimization  when 
applied  to  parameter  estimation  problems.  We 
suspect  that  the  ad  hoc  scaling  solution  used  in 
this  research  will  generalize  to  a  rule  which 

may  be  applied  in  a  wide  range  of  estimation 
problems . 
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Table  1 

Numerical  Comparison  of  Quasi-Newton  and  Newton-Raphson  Methods  on  Selected  Data  Sets 


No.  , 

Oa+fl  set  parameters 

Ne'i^ton-Raph  sjn 

_ _ _ _ _ 1 

Quas  l-Newton 

ON  Advantage 

No.  EM 
’ter^t  tons 

Avg.  Newton 
Iterat  tons 

Bi 

t  (me^ 

No.  EM 
Iterat  tons 

Avg.  ON 
Iterat  tons 

t  Ime^ 

C)omlnar>oe^  6  N/A 

36 

4 

1.4904 

.1373  X  10'^ 

55 

13 

1  .4321 

.1382  X  10"' 

0.01 

Schoolboys^  12  N/A 

10 

4 

1.6781 

.52B7  X  10® 

10 

10 

1.6  781 

.3230  X  10^ 

0.16 

(Stntilated'  60  0.7B 

5 

4 

0.8540 

.5494  X  10^^ 

5 

14 

0.83  37 

.2432  X  10*^ 

2.5 

(SIfTulatod)  ^0  0.84 

5 

4 

0.8837 

.1795  X  10*^ 

5 

14 

0.883  5 

.3019  X  10^“* 

5.9 

(Slfiu  lated'  too  1.20 

4 

4 

1.2255 

.589  7  X  10*  ^ 

4 

15 

1.2252 

.3221  X  lo'® 

12.1 

(SIfTulated)  150  0.^0 

5 

4 

0.9908 

.2000  X  10^' 

5 

15 

0.9907 

.1602  X  10^® 

125.0 

(SIfTuIated)  200  0.90 

4 

4 

1.0280 

.76  76  X  10^3 

4 

16 

1.02  79 

.5517  X  I0^‘ 

14  4.4 

:totes: 


From  Appelby  (inSJ). 

2  From  Kendal  I  (1962). 

Times  are  given  InM  sec.  All  computaT  bns  »ere  car  rl  ed  out  on  a  SIN  5/50  workstat  bn  with  floating  point  acoeleratbn. 
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NUMERICAL  ALGORITHMS  FOR  EXACT  CALCULATIONS  OF  EARLY  STOPPING  PROBABILITIES 
IN  ONE-SAMPLE  CLINICAL  TRIALS  WITH  CENSORED  EXPONENTIAL  RESPONSES 

Brenda  MacGibbon,  Concordia  &  UQAM,  Susan  Groshen,  USC,  Jean-Guy  Lcvrcault,  U.  do  Mtl 


For  some  cancers,  the  existing  treatment  regimens 
produce  long-term  disease-free  survival  rates  of  90%  or 
better.  In  this  situation  a  new  protocol  may  aim  to  re¬ 
duce  the  amount  or  duration  of  treatment,  while  main¬ 
taining  the  high  disease-free  survival  rates.  Although 
the  primary  goal  is  to  evaluate  the  specific  morbitity 
of  such  a  new  protocol,  it  is  desirable  to  develop  rules 
to  stop  the  trial  if  many  patients  die  or  relapse  early 
in  the  study  and  to  study  the  statistical  properties  of 
these  rules  numerically.  Since  the  failure  (death  or  re¬ 
lapse)  or  success  (survival)  of  the  nth  patient  is  not  usu¬ 
ally  observed  before  the  (n-l-l)sL  patient  is  entered  onto 
the  protocol,  most  developed  sequential  techniques  do 
not  apply  to  the  problem.  Most  group  sequential  tech¬ 
niques  involve  large  sample  results,  inappropriate  for 
small  studies.  If  the  survival  times  of  the  patients  fol¬ 
low  an  exponential  distribution  and  the  entry  times  into 
the  trial  are  Poisson,  and  if  these  are  independent,  then 
a  pure  birth-and-death  process  with  a  well-defined  tran¬ 
sition  matrix  is  an  appropriate  model.  Analysis  of  the 
process  enables  the  expression  of  error  rates  in  terms 
of  the  transition  probability  matrix  and  renders  these 
calculations  computationally  feasible.  A  conceptually 
simple  design  for  monitoring  a  trial,  in  which  a  new 
treatijjeut  is  evaluated  after  each  observed  failure,  ic 
presented  and  algorithms  to  calculate  the  error  rates 
of  interest  are  given.  Algorithms  for  the  calculation  of 
the  average  sample  number  (ASN),  the  median  and  the 
quartiles  of  the  sample  size,  as  a  function  of  the  ratio 
of  the  entry  rate  to  the  failure  rate,  are  constructed. 
Approximations  to  these  exact  results  are  also  given  by 
the  use  of  the  ballot  problem.  Finally,  the  methods  are 
illustrated  on  an  example  involving  the  design  of  a  pilot 
study. 

1  INTRODUCTION 

In  the  above  setting,  guidelines  or  criteria  would  be 
useful  in  helping  the  investigator  to  decide  when  there 
have  been  “too  many”  deaths  or  failures  to  justify  the 
continuation  of  the  trial.  Two  problems  may  arise  in  es¬ 
tablishing  criteria  in  these  situations.  The  first  occurs 
when  the  response  variable  of  interest  is  a  time  variable 
such  as  s\irvival,  remission  duration,  or  disease-free  sur¬ 
vival:  the  failures  (death  or  relapse)  or  successes  (sur¬ 
vival  or  continuing  remission)  of  the  first  n  patients  are 
not  usually  observed  prior  to  the  entry  of  the  (n-t-l)st 
patient  into  the  study.  Thus  the  classical  sequential 
techniques  such  as  those  described  in  Wald  (1947),  can¬ 
not  be  applied  to  this  situation.  The  second  problem 
occurs  because  the  expected  survival  in  the  proposed 
study  is  high  or  the  total  number  of  available  patients  is 
limited.  Either  situation  would  imply  that  the  observ'ed 
number  of  failures  will  probably  be  small.  Sequential 


and  the  more  recently  developed  group  sequential  tech¬ 
niques  appropriate  for  censored  data,  or  adaptable  to 
censored  data,  rely  on  large  sample  theory  for  proba¬ 
bility  calculations  (Pocok  [1977],  [1982],  O’B  rien  and 
Fleming  [1979],  Majunder  and  Sen  [1978],  Gail  [1982], 
Jennison  and  Turnbull  [1983]).  It  has  been  observed 
(Gross  and  Clark  [1975],  Lesser  and  Cento  [1981], 
Bcnedetti  et  al.  [1982])  that  for  analyzing  censored 
data,  the  effective  sample  size  at  a  given  time  for  a 
group  of  n  patients,  is  approximately  the  niunber  of 
failures  observed  prior  to  that  time,  and  furthermore 
that  asymptotic  approximations  depend  on  the  number 
of  failures  (Selke  and  Siegmund  [1983],  Slud  [1984],  Tsi- 
atis  [1982]).  Thus  asymptotic  approximations  may  not 
be  appropriate  under  these  circumstances. 

Any  sequential  or  group  sequential  procedure  ap¬ 
propriate  for  the  one-arm  pilot  study  described  above 
will  therefore  require  exact,  finite-sample,  probabilities 
bcised  on  a  nonparametric  method  or  on  a  procedure 
designed  for  a  specific  parametric  survival  distribution. 
Because  of  the  limited  number  of  failures  expected  in 
this  setting,  nonparametric  statistics  will  be  insensitive; 
parametric  techniques,  if  appropriate,  will  be  more  pow¬ 
erful.  Since  many  survival  patterns  can  be  well  sum¬ 
marized  with  an  exponential  curve,  several  sequential 
methods  have  been  proposed  for  the  exponential  distri¬ 
bution.  However  none  of  them  are  quite  suitable  for  the 
type  of  trial  under  consideration.  As  demonstrated  by 
Barndorff-Nielsen  and  Cox  [1984],  with  staggered  en¬ 
try,  the  distribution  of  the  one-sample  likelihood  ratio 
test  (and  therefore  the  maximum  likelihood  ratio  esti¬ 
mator)  for  the  parameter  of  the  exponential  curve  is 
not  explicitly  known  and  is  usually  approximated  using 
large  sample  results.  Epstein  and  Sobel  [1955]  consid¬ 
ered  one-sample  sequential  procedures  for  exponential 
failure.  Their  techniques  were  not  appropriate  for  cen¬ 
soring  due  to  staggered  entry  and  ultimately  involved 
large  sample  calculations.  Breslow  and  Haug  [1972]  de¬ 
veloped  two-sample  methods  for  comparing  exponential 
survival  curves  which  used  eisymptotic  approximations. 
Canner  [1977]  employed  computer  simulations  to  de¬ 
velop  critical  regions  for  a  group  sequential  procedure  to 
compare  two  survival  curves.  Klein  and  Lerche  [1983] 
proposed  methods  which  could  lead  to  exact  calcula¬ 
tions  for  the  sequential  comparison  of  two  exponential 
survival  curves  but  used  large  sample  approximations 
to  obtain  results. 

When  the  survival  times  are  exponentially  distri¬ 
buted  and  entry  into  the  study  is  Poisson  and  inde¬ 
pendent  of  the  survival  times,  the  problem  can  be  mod¬ 
eled  as  a  pure  birth-and-death  process  (see  Ross  [1980]). 
This  will  accommodate  censoring  d\ie  to  staggered  entry 
and  does  not  rely  on  asymptotic  theory,  thus  permitting 
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one  to  calculate  the  exact  size  and  power  of  any  preas- 
signed  decision  plan.  Analysis  of  the  process  enables 
the  expression  of  error  rates  in  terms  of  the  transition 
probability  matrix  and  renders  these  calculations  com¬ 
putationally  feasible.  A  conceptually  simple  design  for 
monitoring  a  trial,  similar  to  one  previously  proposed  by 
Breslow  [1970],  in  which  a  new  treatment  is  evaluated 
after  each  observed  failure,  is  presented  in  this  paper 
and  the  error  rates  of  interest  are  determined.  The  av¬ 
erage  sample  number  (ASN),  the  median  and  the  quar- 
tiles  of  the  sample  size,  are  calculated  as  a  function  of 
the  ratio  of  the  entry  rate  to  the  failure  rate. 

2  THE  TESTING  PROBLEM  AND  PROCEDURE 

Each  patient  who  will  be  entered  into  the  trial  will 
be  represented  by  the  pair  of  random  variables,  (JV,  K) 
where  X  is  the  entry  time  of  the  patient  measured  from 
time  zero,  the  start  of  the  trial,  and  where  V  is  the  time 
at  which  the  patient  fails,  also  measured  from  time  zero. 
We  will  consider  the  case  where  Y  —  X,  the  survival  from 
entry  of  the  patient  into  the  trial,  is  exponentially  dis¬ 
tributed  and  where  the  entry  into  the  trial  is  Poisson. 
If  patient  entry  is  Poisson,  then  the  waiting  times  are 
exponential.  That  is,  -Y,+  i  —  follows  an  exponen¬ 
tial  distribution,  where  X,  is  the  entry  time  of  the  ith 
patient  that  comes  into  the  study.  Let  1/A/  and  l/A* 
be  the  expected  values  of  the  exponential  distributions 
of  the  failure  times,  (V',  —  A,),  and  the  waiting  times 
between  entries,  ( Ai+i  —  Aj),  respectively.  The  random 
variables.  A',  and  V,  —  A,  are  assumed  to  be  indepen¬ 
dent. 

If  in  previous  investigations  with  the  intensive  or 
standard  therapy,  the  mean  survival  time  has  been  fi*, 
then  for  ethical  reasons,  we  require  that  1/A/  beised  on 
the  modified  therapy  under  consideration  be  at  least  ^^s 
large  as  fi*.  Thus  the  hypotheses  under  consideration 
are: 

'■  A/  <  l/fi* 

Ha  ■■  A/  >  l/n*  . 

The  proposed  trial  design  (i.e.  the  stopping  and 
decision  rule)  for  this  testing  problem  can  be  summa¬ 
rized  as  follows:  if  at  any  time,  the  “simple”  failure 
proportion,  which  is  defined  to  be  the  observed  ratio  of 
number  of  failures  to  total  number  of  treated  patients 
exceeds  some  predetermined  threshold  (which  may  de¬ 
pend  upon  the  number  of  failures)  then  the  trial  will 
be  stopped.  This  is  the  boundary  proposed  by  Breslow 
[1970]  for  binomial  responses.  More  specifically, 

1)  Plan  to  enter  a  maximum  of  N  patients. 

2)  Establish  a  threshold,  W *,  such  that  if  the  “simple” 
failure  proportion  exceeds  W*,  the  treatment  will 
be  considered  ethically  unacceptable.  This  will  lead 
directly  to  a  sequence  of  critical  numbers,  rii*  < 
n2*  <  nj*  <  ■■  ■  <  n/*  =  N,  where  n^*  is  the 
smallest  integer  greater  than  or  equal  to  ijW*.  The 
W*  is  chosen  not  only  to  control  error  rates,  but 
is  also  based  on  ethical  considerations  which  reflect 
imacccptably  high  values  of  A/. 


3)  At  the  time  of  each  failure,  record  the  number  of 
patients  entered  on  to  the  tiral.  Let  n,  be  the  total 
number  of  patients  who  have  begun  treatment  at 
the  time  of  the  ith  failure. 

4)  If  at  the  i^  failure,  n,  <  n,*,  stop  the  trial  and 

reject  Ho-  If  continue  accruing  patients 

until  the  next  failure  is  observed  or  until  N  patients 
have  been  treated. 

5)  When  patient  accrual  has  terminated  according  to 
(4)  above,  then  a  complete  analysis  of  the  data 
will  be  undertaken.  The  rules  above  are  proposed 
to  monitor  the  study  respecting  ethical  considera¬ 
tions,  and  not  to  replace  further  appropriate  anal¬ 
yses. 

3  THE  BIRTH-AND-DEATH  PROCESS  MODEL 

3.1  Notation  and  Definitions 

Now  let  us  define  an  event  to  be  cither  the  entry  of 
a  patient  onto  the  trial  or  the  failure  of  a  patient.  Let 
the  pair  (r,  jr)  denote  the  state  with  exactly  /V  failures 
by  the  rth  event  and  prior  to  the  (r  -f  1)^  event.  A 
permissible  path  will  be  defined  to  be  a  sequence  of  the 
pairs  (r,>),  {(0,0),  (1,;,).  (2,;2).  •  •  •  (fc,;*)).  satisfy¬ 
ing  ji  <72  <,  •  •  •  <  jk  and  r  >  2jr  for  all  r  =  1,  •  •  •  k. 
Let  $  denote  the  set  of  permissible  paths. 

Now  define  S,  (for  i  =  1.2,  •  ••)  to  be  the  subset 
of  all  permissible  paths  that  represent  trials  continued 
through  the  (t  —  l)sL  failure  and  stopped  at  the  ith  fail¬ 
ure.  Thus  Si  =  [pe$  :  p  =  {(0,0).(l,7i),(2,72),  ■  ■ 
(ri/r),  •  •  i{ni, /,„)}  such  that  whenever /r  =  t  it  follows 
that  r  <  n,  ♦  -ft  -  1  and  whenever  /V  <  *  it  follows  that 
r  —  jr  >  u/r*.]  A  path  in  5,  must  have  strictly  fewer 
than  j  failures  at  event  time  j  -I-  rij  ♦  —1  (for  1  <  7  <  i) 
and  therefore  must  have  at  least  rij*  entries  into  the 
trial  at  that  time.  At  event  time  i  -|-  n,  *  —1,  the  path 
will  have  at  least  i  failures  and  no  more  than  n,  *  —1 
entries  into  the  trial. 

The  probability  of  stopping  the  trial  at  the  ith  fail¬ 
ure  is  the  sum  of  the  probabilities  of  all  the  paths  in 
Sm-P^IS,}  =  Pr{p}-  Thus  the  probability  of 

continuing  the  trial  to  the  end  is  1  —  Fr{u[_i  S, }.  Let  us 
denote  this  probability  by  P{C}.  To  calculate  P{C}, 
we  will  model  this  problem  as  a  birth-and  death  process. 

3.2  The  Birth-and-Death  Process  Model 

If  the  assumptions  of  exponentiality  and  indepen¬ 
dence  in  the  preceding  section  hold,  then  we  have  a 
birth-and-death  process  where  the  states  are  determined 
by  the  number  of  patients  alive  and  on  trial  (see  Ross 
[1980],  Chapter  6,  Section  3)  and  whose  transition  prob¬ 
ability  matrix,  F  =  {F,j}  is  given  below: 

(1) Fo,  =  1 

(2) for  1  <  i.7  <  N  -b  1 
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•Pi.i+i  — ) 

•Pi, 1-1  =  +  -^t) 

Pi.j=o  for  j  ^  i  —  1  or  i  +  1 
Pn+i.n+1  =  1 

Pi  i+i  is  the  probability  that  another  patient  enters  the 
trial  prior  to  any  failure  when  there  are  i  patients  on 
trial  and  at  risk  for  failure.  Pi,i_i  is  the  probability  that 
a  patient  fails  prior  to  another  entry,  when  there  are  i 
patients  at  risk.  N  is  the  total  number  of  patients  that 
will  enter  the  trial  if  early  termination  does  not  occur 
and  /  —  1  is  the  total  number  of  failures  permitted  prior 
to  the  entry  of  the  ATth  patient. 

P  is  the  transition  matrix  of  a  Markov  Chain  whose 
states  denote  total  number  of  patients  alive  and  in  the 
trial,  and  P,j  represents  the  probability  of  moving  from 
state  !  to  state  j  after  the  occurrence  of  one  event  (entry 
or  failure).  Let  P/'j  be  the  (i,  j)th  entry  in  the  product 
matrix  P'".  P[^  represents  the  probability  of  moving 

from  state  i  to  state  j  after  the  occurence  of  r  events. 
Thus  Pr{(r,y)}  =  Po'',r_2j- 

To  calculate  the  probability  of  terminating  the  trial 
prior  to  the  entr}'  of  the  initially  specified  N  patients,  for 
given  A  j  and  A^,  we  will  use  the  transition  matrices,  P'^, 
to  calculate  the  exact  probabilities  of  the  sots  Sj.  These 
transition  matrices  also  enable  us  to  easily  compute  the 
average  sample  numbers  (ASN),  the  usual  measure  of 
effectiveness  of  stopping  rules.  At  the  same  time,  it 
was  felt  that  the  median  sample  size  and  the  quartile 
sample  sizes  could  be  viewed  as  a  more  robust  measure 
of  effectiveness  of  the  early  stopping  mechanism  and 
the  transition  matrices  have  al.so  been  used  to  calculate 
these  quantities. 

More  explicitly,  in  order  to  facilitate  the  discussion 
of  the  probability  calculatioas,  we  will  limit  ourselves  to 
the  following  hypothetical  trial  with  the  following  pre¬ 
cise  stopping  rules  (the  method  can  be  easily  modified 
^  for  other  values  of  the  n,*’s): 

1)  Do  not  plan  to  stop  directly  after  the  lst_  or  2nd 
failure  (rii*  =  n^*  =  1) 

2)  If  the  3rd  failure  occurs  before  the  20  th  entry,  stop; 

if  not,  continue  (us*  =  20) 

3)  If  the  4fh  failure  occurs  before  the  27tji  entry,  stop; 

if  not,  continue  ("4*  =  27) 

4)  If  the  5th  failure  occurs  before  the  33rd  entry,  stop; 

if  not,  continue  (’ir,*  =  33) 

5)  If  the  Gtli  failure  occurs  before  the  40lh  entry,  stop; 

if  not,  continue  (ug*  =  40) 

G)  If  the  7t^  failure  occurs  before  the  GOtL  entry,  stop; 
if  not,  continue  (ut*  =  GO) 

7)  Stop  at  the  GOtji  entry. 

The  events.  S| .  5-2.  S3.  •  ■  ■  St  and  C.  will  be  defined 
as  before  in  Section  3.1.  Let  T(i,j)  denote  the  outcome 
of  being  in  the  jtli  state  at  the  itji  event  time  for  i  >  j. 

Figure  1  ran  be  used  to  visualize  penni.ssible  paths 
for  the  given  trial  design.  Intuitively  a  permissible  path 


will  be  a  path  with  a  non-positive  slope  that  passes 
through  balls  at  each  of  the  six  stages.  An  S3  path 
will  pass  from  r(0, 0)  to  a  white  ball  at  event  time  22. 
An  St  path  will  be  a  permissible  path  passing  through 
black  balls  at  event  times  0,  22.  30  and  37,  and  a  white 
ball  at  event  time  45. 


FIGURE  1  TWO  PERMISSIBLE  P.ATHS  FOR  TRIAL 
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The  mathematical  formulation  will  now  be  devel¬ 
oped.  T(i,j),  the  outcome  of  being  in  the  jth  state  at 
the  ith.  event  time  {i  >  j)  actually  represents  a  trial 
in  w'hich  there  have  been  exactly  i  +  (i  —  j)/2  entries 
and  {i  —  j)/2  failures.  (Note  that  the  Pr{T(i.  j)}  =  0 
if  i  -  j  is  not  even).  Let  v[T{i,j)  — ►  T(k.rn)]  repre¬ 
sent  the  set  of  permissible  paths  from  T(i.J)  (i  >  j) 
to  T(k,m)  (k  >  m).  The  fact  that  tt  contains  only 
permissible  paths  implies  that  (i  —  j)/2  <  {k  —  m)/2. 
Clearly,  if  all  trials  were  to  be  run  from  the  0th  until 
the  22nd  event  time,  then  the  7r[T(0,0)  — >  T'(22,2n)]  for 
n  =  0. 1,  •  •  •  ,  8.  would  represent  permissible  paths  which 
would  be  stopped  by  the  stopping  rule  for  the  3rd  fail¬ 
ure.  Since  Pr{[7rr(0.0)  — *  r(22.2ri)]}  is  equal 

to  ].  and  since  7r[T(0.0)  — »  7'(22,22)],  7r[r(0,0)  — ♦ 
T(22.20)]  and  7r[r(0.0)  — •  r(22, 18)]  represent  the  sets 
of  all  permissible  jiaths  of  trials  not  stopped  by  the  stop- 
])ing  rule  for  tin  3r1  failure,  then  the  prol)abilities  of 
stopping  or  continuing  at  this  stage  can  be  ca.sily  calcu¬ 
lated  using  the  stochastic  matrix,  P.  defined  in  Section 
3.2. 

Let  r‘  represent  the  (row)  vector,  (0.  •  •  ■  .  0.  1. 0.  ■  ■  • ), 
with  a  1  in  Tie  (i  +  1  )sl  place  and  O's  elsewhere.  Let  (  v), 
rcpre.senf  the  jt^  element  of  the  vector  contained  within 
the  parentln'.ses,  v.  Now,  Pi’l  7r[T(0.  0)  — *  r(22.2n)]} 
can  be  written  as  (r'’P*^)2„,  the  (2n)t_h  element  of  the 
vector  More  generally,  if  7r[T(i.j)  — *  T(A'.m)] 

represents  the  set  of  permissible  paths  (that  is,  j  > 
j,k  >  m,  +  ( /  —  /  )/2  <  (k  —  tii)/2.  and  (1  -  j)  and  {k  -  rn) 
are  oven)  then  we  have 

Pr{TT\T{,.j)  T(I'.m)|}  =  )m  (.41) 

If  TT{T{i.j)  — *  T(k.ui)  — *  T(ii.<j)]  is  permissible,  then 
its  probability  is  given  by 

X  (P"P"-*),  (.42) 

The  dotted  line  on  Figure  1  joining  (0,0).  to  (22.20) 
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to  (30,26)  to  (37,33)  to  (45,39)  to  (66,52)  represents 
a  subset  of  permissible  paths  with  an  endpoint  of  59 
entries  and  7  failures  (that  is,  a  trial  stopped  only  at  the 
7th  failure).  Its  probability  is  calculated  as  (e®p^^)2o  ^ 
{e''>P«)26  X  (e^«P^)33  X  (e”P«)39  X  (e^^p^' 

Thus  the  probability  of  the  events  5] ,  S2,  •  ■  • ,  S7, 
and  C,  can  be  calculated  exactly,  asine  the  stochastic 
matrix.  It  suffices  to  enumerate  the  permissible  paths 
for  each  event  and  to  calculate  their  exact  probabilities 
using  extensions  of  equations  (Al)  and  (A2). 

For  determining  the  ASN,  the  median,  and  the 
quartiles  of  the  sample  size,  the  above  calculations  can 
be  modified  to  compute,  for  each  n,  the  probability  of 
stopping  the  trial  after  the  entry  of  the  nth  patient  and 
prior  to  the  entry  of  the  (n  +  1)^  patient.  To  see  this, 
let  i  be  the  number  of  failures  such  that  n*_,  <  n  <  n*. 
The  trial  will  stop  after  the  nth  entry  only  if  the  itli 
failure  occurs  between  the  nth  and  (n  +  l)st  entries. 
Therefore  the  probability  of  stopping  at  n  patients  is 
the  sum  of  the  probabilities  of  the  paths  in  S,  with  ex¬ 
actly  n  -f  i  events. 

3.3  An  approximation  to  the  true  probability  of  the 
events  S3,  ■  ■  ■  Si,  C  using  the  ballot  problem. 

Recall  that  if  we  have  two  independent  Poisson  pro¬ 
cesses  with  parameters  Ai,A2  then  probability  that  the 
nth  waiting  time  in  the  first  process  occurs  before  the 
mth  waiting  time  in  the  second  process  is: 

/n  +  m-l\  /  A,  y-/  A2 

Aa,  +A2y)  U.  +A2/ 

.Although  f)ur  Poi.sson  proce.sses  (the  entry  proce.ss  and 
the  failure  process)  are  far  from  being  independent,  we 
can  use  a  similar  ai)j)roach  if  we  make  the  following 
a.ssumi)tion. 

A  sequence  will  be  said  to  be  of  type  [(  .  7]  if  it  con¬ 
sists  of  exactly  i  entiles  and  j  failures.  In  order  that  a 
[1.7]  type  secpience  represent  a  trial,  we  will  require  that 
for  each  subsequence  5*  of  length  k-  <  i  +  7.  the  number 
of  entries  in  5*  be  greater  than  the  number  of  failures 
in  Sk.  Such  a  sequence  will  be  called  admi.ssible. 

Now  let  as  a.ssume  that  the  probability  of  any  se¬ 
quence  of  type  [/.  7]  is  equally  likely  ( this  probability  will 
be  denoted  A(i.j)):  standard  mathematical  techniques 
can  be  ased  to  approximate  or  bound  this  probability. 
Under  this  a.ssumption,  the  weak  sense  version  of  the 
ballot  problem,  (c/ Barton  and  Mallows  (1965))  can  be 
applied  ano  the  probability  that  a  sequence  of  tyi>e  [1,7] 
is  admissilde  =  ( t  -f  1  -  7 )/( t  -t-  1 )  (Let  us  denote  this 
probability  by  q{i.j)). 

Tin'll  the  probability  of  03  entries  before  U3  failures, 
that  is.  the  probaiiility  of  not  stopping  the  trial  at  the 
first  stage  is; 

”3  +  J 

y.  Pr {£(.;.  ?/,•)  +  l-.-j  -  1  - 


where  E{i,j)  =  union  of  all  type  [2,7]  admissible  paths 
and 

Prob  {E{i,j)}  = 

Total  ^of  admissible  sequences  of  type  (i,2lxA(i,j) 
Total  Tjfof  sequences  of  type  (i+j-Ar.itjxAti-sj-it,*) 

^ _ (‘to  A(b;) _ 

(to  q{l-^2-k,mi+}-k,k) 

where  ((t  -f  7  )/2]=  largest  integer  less  than  or  equal  to 
(*+i)/2.  Thus,  the  probability  of  not  stopping  the  trial 
at  the  first  state  is 

03  +  ^3  —1 

^  r(7,  n3 -f  U3  -  1  -  7) 

;=n3 

where  r(j],n3  -f  ^.'3  -  1  -  j)  = 

(  )  9(jl  "3  +  ^:3  — 1  “j)  (  «3  +  ^3  —  1  “■  j) 

[(n3+J:3^1)/2] 

)  (?( n3  + A:3~l-tn  .m )  A(n34-i'3— ) 

m  =0 

These  approximations  tend  to  work  reasonably  well  in 
practice. 

4.  AN  FX AMPLE  OF  THE  STOPPING  RULE  FOR 

noon  PROnNOSIS  P.ATIENTS  with 
QtiTFOnENIC  SARCO.MA 

The  calculations  presented  in  this  manuscript  were 
prompted  by  the  following  clinical  situation. 

One  method  of  treating  the  bone  cancer,  osteogenic 
sarcoma,  in  a  subgroup  of  children  involves  intense 
chemotherapy  followed  by  surgery  and  then  more 
chemotlierapy  (see  Rosen  ct  al.  [1982]  and  Rosen  et 
al.  [1983]).  Examination  of  the  tumor  after  remoral  by 
surgery  ran  identify  those  patients  with  tumors  which 
are  very  sensitive  to  the  pre-operative  chemotherapy 
and  have  had  at  least  a  90%  tumor  reduction.  These 
patients  with  responsive  tumors  appear  to  have  a  rea- 
sonalby  good  prognosis  with  the  probability  of  disease- 
free  survival  estimated  to  be  approximately  ,S5(±  .077= 

5. E.)  at  three  years  from  the  start  of  therapy  (unpub¬ 
lished  update.  Rosen  (1982)),  However,  the  rhemother- 
apv  regimen  is  cpiite  intense,  with  both  short-term  and 
j>ossible  long-term  side-effects.  .A  modified  treatment 
protocol  was  proposed  to  shorten  the  duration  of  the 
post -operative  chemotherapy,  in  those  patients  who  had 
experiencrsl  at  least  90%  tiiinor  reduction  as  a  result  of 
the  preoperative  chemotherapy.  The  goal  was  to  re¬ 
duce  the  severity  of  the  side-effects  while  maintaining 
the  overall  higher  probalrility  of  disea.se-free  survi\'al. 

It  was  estimated  that  approximately  12-15  patients 
a  vear  wo^ild  be  eligible  for  the  study.  Since  a  stud\ 
lasting  over  5  years  was  not  considered  practical,  it  was 
decided  to  plan  a  single-arm  stvuly  with  60  patients.  .Al¬ 
though  the  main  objective  of  the  stiidv  was  to  evaluate 
toxicity  and  side  effects,  it  was  agreed  that  a  mechanism 
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to  monitor  the  number  of  disease  recurrences  as  well  as 
deaths  was  necessary.  Monitoring  rules,  such  as  the  one 
in  Section  2,  were  proposed.  Based  on  past  studies,  it 
w^ls  decided  that  the  assumptions  of  Poisson  arrival  for 
treatment,  and  of  exponential  failure  during  the  first  3 
years  after  the  beginning  of  therapy,  were  reasonable 
and  would  provide  useful  approximations  to  the  actual 
distributions. 

The  stopping  rule  proposed  in  the  previous  section 
has  its  critical  values  chosen  so  that  the  “simple"  failure 
proportions  would  never  exceed  15%.  n  j*,  and  112*  were 
set  to  1  so  that  the  trial  would  not  stop  after  one  or  two 
early  failures.  The  maximum  number  of  patients  was  set 
to  CO  and  the  maximum  number  of  failures  permitted 
was  determined  by  the  critical  region  for  a  .05  level  one¬ 
sided  test  of  a  binomial  parameter,  tt  =  .05  with  n  =  60. 

Table  1  presents  the  probabilities  of  stopping  early 
for  a  variety  of  ratios  of  entry  rate  (Xg)  to  failure  rate 
(A/).  For  the  particular  situation  under  consideration  in 
this  example,  we  would  want  a  high  probability  of  early 
stopping  if  the  three-year  survival  were  much  below  0.80 

TABLE  1:  PROBABILITY  OF  STOPPING  EARLY* 


In  (A,/Ay)(A.  =  10)  (A,  =  12)  (A,  =  15)  *♦ 


3.20 

0.294 

0.231 

0.160 

1.000 

3.30 

0.331 

0.265 

0.190 

1.000 

3.40 

0.367 

0.301 

0.223 

1.000 

3.50 

0.404 

0.337 

0.257 

1,000 

3.60 

0.441 

0.374 

0.292 

1.000 

3.70 

0.476 

0.411 

0.329 

1.000 

3.80 

0.511 

0.447 

0.365 

1.000 

3.90 

0.545 

0.483 

0.402 

1.000 

4.00 

0.577 

0.517 

0.439 

1.000 

4.10 

0.608 

0.551 

0.474 

1.000 

4.20 

0.638 

0.583 

0.509 

1.000 

4.30 

0.666 

0.614 

0.543 

1,000 

4.40 

0.092 

0.643 

0.576 

0.999 

4.50 

0.717 

0.670 

0.607 

0.997 

4.60 

0.740 

0.696 

0.636 

0.993 

4.70 

0.761 

0.721 

0.664 

0.984 

4.80 

0.781 

0.744 

0.691 

0.969 

4.90 

0.800 

0.765 

0.715 

0.944 

5.00 

0.817 

0.785 

0.738 

0.905 

5.25 

0.854 

0.828 

0.790 

0.744 

5.50 

0.885 

0.863 

0.832 

0.515 

o./o 

0.909 

0.892 

0.867 

0.294 

6.00 

0.928 

0.915 

0.894 

0.139 

6.25 

0.944 

0.933 

0.917 

0,056 

6.50 

0.956 

0.947 

0.935 

0.020 

6.75 

0.965 

0.959 

0.949 

0.007 

7.00 

0.973 

0.968 

0.960 

0.002 

*  Col 

umns  2.3,4 

corresjjond 

to  three-y- 

ear  survi\-al 

corresponding 

to  entry  rates  of  10, 

12  and  15 

patients/year  respectively. 

Column  5  corresponds  to  the  probability  of  stop¬ 
ping  early. 


and  we  would  certainly  want  a  small  probability  of  early- 
stopping  if  the  three- year  survival  were  0.90  or  better. 

Figure  2  plots  the  median,  mean,  upper  and  lower 
quartiles  for  the  number  of  patients  entered  for  SR3. 
Once  more,  the  mean  and  median  are  similar  for  the  val¬ 
ues  of  ln{XJXf)  under  consideration.  The  mean  tends 
to  be  larger  than  the  median  when  the  probability  of 
early  stopping  is  higher  and  therefore  the  distribution 
is  somewhat  skewed  to  the  left;  the  mean  tends  to  be 
smaller  than  the  median  when  the  probability  of  early- 
stopping  is  lower  and  therefore  the  distribution  is  some¬ 
what  skewed  to  the  right. 

FIGURE  2:  Number  of  patients  to  be  entered  into  studv 
with  stopping  rule  1:  Expected  number,  .Median,  Lower 
and  Upper  quartile. 
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5.  CONCLUSION 

With  the  calculations  of  the  entries  in  the  matrices 
P"(n  >  1)  for  different  values  of  A,  and  Xf,  the  exact 
probability  of  not  stopping  early  can  be  computed  for 
a  given  trial  design  by  stunming  over  the  appropriate 
products  as  pre.sented  in  the  appendix.  With  these  cal¬ 
culation,  the  proposed  trial  design  can  be  evaluated  in 
terms  of  size,  power,  .ASN,  and  median  sample  size.  In 
practice,  the  patient  referral  patterns  are  often  known 
from  past  experience  and  thus  A,  may  be  estimated; 
different  values  of  A|  can  be  used  to  evaluate  the  de¬ 
sign.  This  manuscript  confined  itself  to  the  study  of 
one  stopping  rule;  in  MacGibbon  et  al.  [1988|  several 
stopping  rules  are  compared  and  examined  according  to 
the  above  criteria. 

Canner  [1977]  also  considered  the  problem  of  moni- 
tering  a  trial  when  the  failure  was  exponential.  LNing 
computer  sinmlations  for  much  larger  studies,  he  found 
that  his  restilts  were  rea.sonably  robust  against  changing 
referral  patterns,  but  quite  a  bit  more  sensitive  to  depar¬ 
tures  from  the  a,ssumption  of  exponentially-  distributed 
failure  data.  The  effect  of  \arying  referral  patterns  in 
the  setting  under  consideration  is  currently  being  stud¬ 
ied. 

Since  the  calculations  of  the  size  and  power  rfr 
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are  exact  if  the  failure  distribution  is  exponential  and 
if  patient  entry  is  Poisson,  then  for  small  and  moderate 
sized  studies,  the  proposed  sequential  stopping  rules  can 
be  used  as  exact  procedures  -  -  thus  establishing  the 
objective  criteria  to  permit  the  necessary  monitoring  of 
one-sample  studies. 
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ABSTRACT 

Colculating  maximum-likelihood  estimoles  for  a 
mixture  of  normal  distributions  con  be  one  of  the 
most  computationally  intensive  problems  in 
parametric  estimation.  Maximizing  the  corresponding 
likelihood  function  is  complicated  by  singularities  and 
numerous  spurious  maximizers  Currently  the  most 
popular  technique  for  finding  maximizers  of  the 
likelihood  function  is  the  EM  (Expectation 
Moximizotion)  algorithm.  While  this  iterative 
algorithm  is  extremely  reliable  and  usually  finds  the 
“good"  maximizer  from  most  reasonable  initial 
guesses,  it  is  very  slow  in  cases  where  the  overlap 
between  component  normal  distributions  is  great. 
Another  approach,  which  is  faster  though  thought  to 
be  less  reliable,  is  to  directly  maximize  the  likelihood 
function  using  a  (locally)  fast  iterative  algorithm 
based  on  some  variant  of  Newton's  method.  The 
disadvantage  with  these  quasi-Newton  methods  is 
that  sometimes  the  estimate  obtained  is  very 
dependent  on  the  initial  guess  used.  This  paper 
presents  some  preliminory  numerical  results 
indicating  the  relative  strengths  and  weaknesses  of 
the  EM  and  quasi-Newton  approaches  found  by  testing 
several  methods  on  a  variety  of  mixture  estimation 
problems.  Comparisons  mode  include  the 
computational  efficiency  and  reliability  of  the 
approaches  tested  The  ultmate  goal  of  this 
research  is  to  learn  how  the  two  basic  approaches 
con  be  hybridized  in  order  to  achieve  a  method  that  is 
both  quickly  convergent  and  reliable. 

t.  INTRODUCTION 

Normal  mixtures  ore  a  widely  applicable  modeling 
tool  whenever  the  statistical  population  of  interest 
is  itself  composed  of  subpopulations  which  ore 
distributed  according  to  different  normal 
distributions.  A  t-variate  normal  density  p(x||i,E), 
with  t-variate  mean  vector  p  and  t  x  t  symmetric, 
positive  definite  covariance  matrix  Z,  is  defined  for 
t-variate  real  x  by 

p(x|^,Z)-(exp(-(x-^)Tz-'(x-|iy/2))  /  ((2n)i'/‘2|Z|'/2)) 

A  mixture  of  m  t-variate  norma)  distributions  Pin(x|0) 
is  defined  by 


p^(x|0)  *  a , p(x|n  1 , Z , )  +  +  a„p(x|fifn, Z^)  (I) 

where  ©  collectively  refers  to  all  the  individual 
component  mean  and  covariance  parameters  along 
with  the  mixing  proportions  a,,  ocj,  .  .  .,  oc^,  which 
must  sum  to  I  and  have  values  between  0  and  1. 

Some  of  the  earlier  applications  of  densities  of 
the  form  in  (1)  are  from  the  field  of  fisheries 
research,  from  which  we  borrow  on  illustrative 
example.  According  to  Hosmer  (1973),  adult  halibut 
of  a  given  age  doss  have  lengths  distributed 
according  to  a  mixture  of  two  univariate  normal 
distributions.  The  lengths  of  the  mole  halibut  ore 
actually  normally  distributed,  as  are  the  lengths  of 
the  females,  but  the  two  normal  distributions 
modeling  the  male  and  female  subpopulations  ore  not 
the  same.  In  this  case  the  overall  population  of  all 
halibut  of  a  given  age  doss  is  not  normally 
distributed,  but  rather  is  distributed  according  to  a 
mixture  of  2  normals 

Pjfxie)  *  a„p(x|Hf.i,art2)  ♦  app(x|^l|:,cJp2)  (2) 

where  for  convenience  we  use  M  and  F  to  denote 
parameters  corresponding  to  the  male  and  female 
subpopulations,  respectively.  Note  that  in  the 
notation  used  above  that  pLf,  for  example,  is  the  mean 
length  of  all  female  halibut  of  a  given  age  doss, 
while  for  example,  can  be  interpreted  as  the 
proportion  of  all  halibut  of  the  given  age  class  that 
are  male.  Accurate  estimates  of  the  complete 
mixture  parameter  6,  based  on  data  {x,,  ,x^} 
consisting  of  the  lengths  of  n  halibut  token  from  the 
population,  would  be  important  to  scientists 
interesting  in  better  understanding  the  population 
dynamics  for  halibut  In  this  paper,  we  are  not 
Interested  in  the  applications,  but  rather  in  the 
computational  problem  of  accurately  and  efficiently 
estimating  the  parameter©  for  mixture  densities  of 
the  form  (1),  given  a  t-variate  sample  1^1-  -Xni 
distributed  according  to  the  unknown  distribution 
An  excellent  reference  for  more  information  on  both 
applications  and  estimation  techniques  is  Redner  and 
Walker  (1984) 
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The  next  section  contains  a  brief  description  of 
the  two  types  of  methods  (EM  and  quasi-Newton)  to 
be  tested,  and  Section  3  contains  a  description  of 
the  numerical  tests  performed,  along  with 
comparisons  of  the  results  obtained  by  the  two 
approaches.  The  last  section  contains  a  discussion 
of  the  results  along  with  ideas  concerning  the 
successful  hybridization  of  the  methods  tested. 

2.The  EM  and  Quasi-Newton  Methods 

The  theory  of  maximum-likelihood  states  that 
good  estimates  of  the  unknown  parameter  can  be 
obtained  from  the  log-likelihood  function  L(@},  which 
is  defined  for  a  given  t-variate  sample  {x,,...,XfJ  and 
density  of  the  form  in  (1),  by 

L(0)=  1og(p^(x,|0))  +  logfPf^txjie))  ♦  •  +  log(Prn(xje)) 

The  maximum-likelihood  theory  applied  to  the  normal 
mixture  case  asserts  that  a  particular  (local) 
maximizer  of  L(9)  will  be  a  good  estimate  for  0q. 
This  "good"  maximizer  is  denoted  here  by  0*,  and  the 
two  approaches  considered  here  result  from  applying 
different  optimization  techniques  to  the  problem  of 
finding  0*  by  maximizing  L(0). 

Both  of  the  optimization  algorithms  are  iterative 
in  nature,  and  generate,  in  theory,  an  infinite 
sequence  of  opproxinotions  to  0*  The 

procedure  starts  with  the  user  supplying  an  initial, 
and  usually  rough,  estimate  0^,)  of  0*.  Then  the 
terms  of  the  sequence  0^2j,0(3;, ..  ,  are  successively 
calculated  until  a  particular  0^5)  is  obtained  that  is 
close  enough  to  0*  to  warrant  termination  of  the 
iteration.  Sometimes  the  iterates  never  get  close  to 
0*,  but  as  a  practical  matter  some  stopping  criterion 
must  be  chosen  for  each  method.  The  main  check 
used  in  these  tests  compares  (in  a  way  described  in 
Section  3)  the  new  iterate  with  the  old  iterate 
0(^j  at  each  step  If  there  is  very  little  difference 
between  the  iterates,  then  this  usually  means  that 
the  accuracy  of  0(r,i)  as  an  approximation  to  9*  will 
not  be  much  improved  by  further  iteration  For  this 
reason  the  implemented  iteration  is  terminated  as 
soon  as  two  successive  iterates  are  very  simtiar 

The  differences  in  the  EM  and  quasi-Newton 
schemes  concern  the  way  that  the  new  iterate 
IS  calculated  For  the  EM  algorithm,  this  is  easily 
described  For  i  »  I ,  .  ,  m  where  m  is  the  number  of 
normal  components  in  the  mixture,  the  following 
calculations  are  done  for  Pj(r.i)-  and  Sjjr,)) 


^Kr)  -  ( I  > 

k-l 

n 

k-l 


^i(r»1J  “  (  2  ^ '^i(r) 

k=l 

The  EM  algorithm  is  simpler  than  methods  based 
on  Newton's  method,  and  most  of  the  important 
theoretical  convergence  properties  of  It  are  given  in 
Redner  and  Walker  (1984).  The  most  important 
convergence  property  of  EM  is  that 

L(0c,,,))  >  L(0(,,) 

for  each  iteration,  so  that  progress  (in  this  sense) 
towards  finding  a  maximizer  is  always  being  made. 
The  quasi-Newton  methods  are  general  purpose 
optimization  tools,  unlike  EM,  and  are  much  more 
complicated.  An  excellent  discussion  of  these 
methods  is  in  Dennis  and  Schnabel  (1963),  and  here 
we  only  discuss  a  few  of  the  basic  ideas 

Optimization  software  is  generally  written  to  do 
minimization,  but  this  poses  no  problem  since 
minimizing  f(0)  =  -L(0)  is  equivalent  to  maximizing 
L(0).  Newton's  method  for  generating  ,)  from  0;^) 

con  be  interpreted  by  first  building  the  quadratic 
model  m(s)  (of  f(0))  w''ich  is  defined  by 

m(s)  =  f(0(^p  ♦  s^  Vf(0(^p  +  (  sT  V2f(0j^p  s  )  /  2 

Note  that  it  is  a  model  in  the  sense  that  m(0)  - 
f(0(r)),  Vm(0)  -  Vf(©(^p,  and  V2m(0)  -  72f(©,^,) 
Next,  assuming  that  V2f(0j|.p  is  positive  definite, 

the  model  function  is  globally  minimized  over  s  to 
obtain 

V)  =  - 

Which  is  then  used  to  define  the  next  Newton  iterate 

by 


(3) 
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This  basic  scheme  was  varied  in  two  different 
ways  in  conducting  the  tests  described  in  the  next 
section.  The  variations  are  clearly  noted  in  reporting 

the  numerical  results,  but  an  explanation  of  them  is 
given  here.  The  first  modification  is  in  the  definition 
of  the  model.  Finite  difference  approximations  to  the 
model  hessian  V2f(0jj.))  and  model  gradient 
are  used.  This  modification  makes  the  general 
software  much  easier  to  use  without  changing  the 
behavior  of  the  iteration  sequence  very  much  (Other 
Important  types  of  approximations  to  the  Hessian  are 
studied  in  Dennis  and  Schnabel  (1983).) 

The  second  type  of  modification  involves  the 
incorporation  of  a  global  strategy  that  makes  the 
iteration  more  reliable.  The  iteration  described  in  (3) 
is  known  to  converge  rapidly  to  the  solution,  when 
started  close  enough,  but  it  can  many  times  fail  when 
used  some  distance  away  from  the  solution.  Usually, 
the  Iteration  in  (3)  is  hybridized  with  a  safer 
iteration  so  that  (3)  is  used  when  close  to  the 
solution  and  the  safer  (and  usually  slower)  iteration 
is  used  away  from  the  solution  The  three 
globalization  strategies  tried  in  the  tests  are  all 
described  in  detail  in  Dennis  and  Schnabel  (1983)  and 
are  called  the  line  search,  double  dogleg,  and  hook 
step.  The  line  search  only  uses  the  direction  given  by 
the  Newton  step  s^^),  while  the  other  two  strategies 
also  use  the  direction  given  by  the  negative  gradient 

-  Vf(0(^))  The  performance  of  these  different  global 
strategies  varies  according  to  the  type  of  problem  so 
that  all  should  be  tested  when  trying  to  determine 
the  usefulness  of  a  quasi-Newton  approach. 

We  last  mention  that  a  reparameterization  is  used 
in  applying  Newton's  method  to  the  problem  of 
minimizing  -L(0).  The  constraint  a,  ♦  ocj  ♦  +  oc^ 

-  1  is  effectively  discarded  by  keeping  .. ,  oy„.,, 

and  replacing  every  occurrence  of  by  1  -  (a,  ♦  aj 
♦  ♦  a^-t) 


3.  Simulation  Runs 

Simulation  tests  were  performed  using  several 
univariate  mixtures  of  two  normal  distributions.  The 
choices  for  the  parameters  a,  p.  and  were  as 
follows: 


Weights  (Alpha) 

Means 

(Mu) 

_L 

_L 

2_ 

(E) 

(50 

50) 

(2)  (0.0 

2.0) 

(N) 

(.20 

.80) 

{4}  (0.0 

40) 

Variance 

1 

(VAR) 

{1} 

1 

(1.0 

Z 

1.0) 

(T) 

(1.0 

0.1) 

The  number  or  letter  in  { }  is  the  designation  used 
for  the  parameter  indicated.  For  example,  (El  stands 
for  Alphas  of  (.5  .5). 

All  possible  combinations  of  the  above 
parameters  were  used  to  generate  sample  data  for 
the  simulation.  To  facilitate  reporting  of  the 
results,  the  following  scheme  was  used  to  name  the 
data  files; 


F  1  E  2  I  0 
UNIVARIATE  CASE  1  :  ■ 

ALPHA  (  5  5)  E  :  : 

MU(0  2)  2  .  : 


VAR  ( I  1 )  1 

TRIAL  NUMBER  0 

(0-9) 
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Tabl0  I;  Numbar  of  Failures 


Good  Guess 

(There  vere  1 0  jamples  of  each  tgpe  used) 

Diatributiaa 

EM 

Liaear  Satrcb 

DHlei 

Heakstep 

tE21 

0 

’  4(0)  (tl 

4  111 

4  HI 

IE2T 

to 

0  (II 

0  [1] 

0  11] 

1E41 

0 

0 

0 

0 

1E4T 

0 

(Dili 

(1)111 

(DHI 

1N21 

1 

7  11] 

7  HI 

7  HI 

1N2T 

to 

0  [1] 

0  11] 

0  HI 

1N41 

0 

1  (DID 

1  (1)111 

KDHl 

IN4T 

0 

0 

0 

0 

Bad  Guess 

Diatribatiaa 

EM 

Liaear  Search 

Dc«Imi 

Haakatap 

1E21 

0 

(6)  (4) 

(4)151 

(5)151 

1E2T 

to 

[101 

1101 

1101 

1E4I 

0 

(4)(61 

(2)17) 

(2)181 

IE4r 

0 

(3)(7) 

(1)19) 

(2)18) 

tN21 

1 

(1)19) 

(1)191 

(1)191 

IM2T 

10 

(3)(7J 

(1)19) 

HOI 

tN41 

0 

215) 

1  (3)161 

(3)161 

1N4T 

0 

(1)171 

(1)181 

(1)191 

1  Note:  ^  4  bad  reaulta;  (0  )  very  bad  reaults;  [  1  ] 

1  failure  to  run 

Bad  results  means  that  the  point  of  convergence  was  reasonable  but 
was  not  'close'  to  the  generating  parameters.  Very  bad  results  means  that 
the  algorithm  converged  but  produced  unreasonablely  large  values  Failure 
to  run  means  that  the  algorithm  did  not  converge,  therefore,  no  results 
were  obtained. 
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Table  II:  Mean  Computational  Requirements 


Good  Guess 

Diatrikatiaa 

EM 

Li  Mar  SMrck 

Oae1a« 

HMkatap 

1E21  ’ 

63.6 

224.9(33.4) 

248(35.1) 

24,4(42,2) 

1E2T 

42,2 

22.2  (32.8) 

22.4(34.6) 

23.2(40.0) 

tE4I 

8.2 

19.7(28.6) 

19.1  (29.9) 

20.8(37.2) 

1E4T 

6.8 

13.8(25.3) 

11.7(25.4) 

20.2  (36.0) 

1N21 

65  2 

29,4(41.6) 

29.4(42.8) 

29  6  (51.9) 

1N2T 

23.3 

26,8(38.7) 

26.4(40.9) 

28.3(46.0) 

IN41 

10.4 

23.3(33  6) 

24.4(35.6) 

24.0(41.6) 

1N4T 

5.0 

24.7(35.3) 

25.8(38.1) 

24.4(42.8) 

Bod  Guess 

Diatribatiaa 

EM 

Li  Bear  Search 

Daqlae 

HMkstap 

1E21 

949 

40  2(60  0) 

55.0  (66.6) 

151.0(177.8) 

1E27 

42,5 

- 

- 

- 

1E41 

17.1 

36,2(52.7) 

52.0  (78.5) 

151.0(171.5) 

1E4T 

12.5 

35.0(51  5) 

36  5  (51  5) 

35,5(45.5) 

lN2t 

99  6 

26  0(35.0) 

40.0(53.0) 

151  0(168.0) 

IN2T 

32.5 

45  0(74  0) 

28  0  (38,0) 

- 

1N41 

21  1 

27.8  (50.8) 

41  5(68.2) 

89  2  (118  7) 

1N4T 

10.2 

35  0  (66  0) 

44  5(70,0) 

151  0  (1980) 

Multi  plications  and  divisions  for  EM  =  Iterations*  1 4»S8mp1e  s1ze(  N) 

Exponential  evaluations  in  EM  =  itarations*2*N 

Multiplications  and  divisions  in  Quasi -Newton  =  (function  calls)*6*M+(grodient  c8lls)*25*N 
Exponential  evaluations  in  Quasi -Newton  »  (function  calls  ♦  gradient  c8ll3)*2*N 
Logarithm  evaluations  =  (function  c8lls)*N 


'  63  6  Iterations 

^  24  9  Function  Calls  (33  4)  Gradient  Calls 

Note  Samples  were  included  in  the  mean  onlgiftheg  ran  to  completion 
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IMSL  subroutine  GGUBFS  was  used  to  generate 
a  random  number  between  0  and  1.  This  number  was 
compared  to  ALPHA  1,  and,  if  it  was  smaller  than 
ALPHA  1,  then  distribution  *1  was  selected  to 
produce  the  random  data  item.  If  the  random  number 
was  larger  than  ALPHA2,  then  distrubution  *2  was 
selected.  The  data  items  was  then  produced  by 
generating  a  random  value  on  a  standard  normal 
using  the  IMSL  routine  GGNQF.  This  standard  normal 
value  was  translated  into  an  equivalent  value  from 
the  distribution  selected.  This  process  was 
repeated  to  produce  the  300  data  points  in  each 
sample.  Ten  samples  were  generated  for  each  set  of 
parameters. 

All  date  files  were  run  using  Em  and  Quasi-Newton 
analysis  using  two  sets  of  initial  values.  These 
initial  values  were  as  follows: 


GUESS  1:  Parameters  Used  To  Create  The  Input  File 


GUESS  2:  J_ 


ALPHA  .35 

MU  P)  -  (pL2  -  P))/4 
VAR  9 


.65 

p,  ♦  (P2-  Pt)/4 
9 


4.  Conclusions 

Test  results  from  Table  I  show  that  the  EM 
algorithm  converges  In  every  case,  even  from  a  poor 
Initial  guess,  whereas  the  Quasi-Newton  algorithm 
gives  poor  to  no  results  when  started  from  a  poor 
initial  guess.  Table  !  also  indicates  that,  when 
started  from  a  good  position,  the  Quasi-Newton 
algorithm  gives  better  results  in  the  cose  of  unequal 
variances  (see  distributions  1E2T  and  1N2T).  This 
remained  true  in  the  case  of  on  even  weight 
distribution  and  an  uneven  weight  distribution 
However,  note  that  most  of  the  inaccuracy  in  the  EM 
runs  for  the  case  of  normals  with  uneven  variances 
resulted  from  a  poor  convergence  to  the  parameter 
associated  with  the  small  variance.  As  was 
expected,  both  algorithms  performed  well  when 
there  was  good  separation  between  the  normals, 
however,  the  Quasi -Newton  algorithm  did 
occasionally  fail  to  run.  There  was  no  noticable 
difference  in  levels  of  convergence  in  either  of  the 
three  Quasi-Newton  strategies. 


The  effectiveness  of  the  EM  algorithm  is 
measured  by  the  number  of  iterations  required  for 
convergence  (See  Table  II  ).  Since  the  main 
computational  effort  incurred  in  the  Quasi-Newton 
strategies  is  in  computing  the  function  values 
and/or  the  gradient  values  and  since  each  iteration 
might  involve  several  calls  to  these  processes,  we 
felt  that  the  number  of  function  colls  and  the 
number  of  gradient  calls  would  be  a  more 
meaningful  statistic  to  use  to  measure  the 
effectiveness  of  the  Quasi-Newton  strategies.  For 
the  samples  run,  the  work  required  to  achieve 
convergence-  in  terms  of  number  of  multiplications 
and  divisions,  number  of  exponential  evaluations  and 
the  number  of  logarithmic  evaluations-  is  slightly 
more  for  each  of  the  Quasi-Newton  strategies  than 
it  is  for  the  EM  algorithm  (See  Table  II.).  The  amount 
of  work  for  the  line  search  and  dogleg  strategies 
was  about  the  same  in  each  case;  the  hookstep 
approach  required  a  little  more  work  in  each  cose. 

These  simulation  results  suggest  that  the 
several  directions  ore  worth  pursuing.  Firstly,  the 
results  from  these  experiments  are  consist  with 
previous  results  which  indicated  that  a  hybrid 
strategy  is  needed  which  starts  with  the  EM 
approach  and  then  switchs  to  a  Newton  approach  or  a 
hill-climbing  approach  when  appropriate.  Secondly, 
a  set  of  experiments  similar  in  nature  to  these 
experiments  should  be  tested  on  multivariate  data. 
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HIGHER  ORDER  FUNCTIONS  IN  NUMERICAL  PROGRAMMING 


David  S.  Gladstein,  ICAD  Inc. 


Introduction 

Many  mathematical  problems  are  defined  as 
combinations  of  elementary  operations  on  func¬ 
tions,  such  as  integration,  differentiation,  and  find¬ 
ing  roots.  Often,  however,  numerical  programs  to 
solve  such  problems  are  hand  crafted  for  the  par¬ 
ticular  application,  rather  than  being  composed  of 
functionally  independent  parts  [1].  This  is  largely 
due  to  the  weakness  of  traditional  algebraic  pro¬ 
gramming  languages  in  manipulating  functions. 

Languages  such  as  LISP  and  Scheme  [2,3] 
treat  functions  as  first  class  objects,  and  allow  one 
to  write  higher  order  functions,  which  have  func¬ 
tions  as  inputs  or  outputs.  This  flexibility  in  the 
manipulation  of  functions  greatly  reduces  the  dis¬ 
tance  between  the  notation  of  mathematical  for¬ 
mulations  and  the  notation  of  the  corresponding 
numerical  program. 

Example 

A  certain  problem  in  sequential  analysis  deals 
with  the  attained  significance  5  of  an  observed 
outcome  (x,n).  S(x,n)  is  computed  as  a  sum  of 
acceptance  probabilities  A(x,i),  which  in  turn  de¬ 
pend  upon  a  family  of  density  functions  /,(x). 

Let  d>  and  $  denote  the  standard  normal  den¬ 
sity  and  distribution  functions,  and  let  a,  6,  n,  and 
0  be  parameters.  Define 

0,  otherwise 

f  fi-i{y)4>{x  -  y  -  d)dy  (»>2) 

J  a 

Aix,i)^  i-^{x-e) 

A{x,i)=  I  fi-i{y){l-^x-y-e))dy  (i  >  2) 

J  <X 

n—  1 

S{x,  n)  =  XI  *)  + 

•=i 

Given  x  and  0  <  /  <  w  <  1,  it  is  required  to 
find  a  confidence  interval  [^/  ^u]  with 

S{x,n\e  =  0i)  =  I 
S(x,  n\0  =  ^u)  =  u 

Routines  for  <t>,  d>.  root  finding,  and  integra¬ 
tion  are  required.  The  first  page  of  listings  at  the 


end  of  this  article  gives  the  Common  LISP  code, 
namely  the  functions  normal-density,  normal- 
distribution  [4],  and  secant  and  romberg  [5]. 
secant  finds  a  zero  of  a  function  using  the  se¬ 
cant  method;  romberg  integrates  a  function  of 
one  variable  using  Romberg’s  method.  These  rou¬ 
tines  were  translated  into  LISP  directly  from  their 
sources  without  regard  to  their  eventual  applica¬ 
tion;  the  code  reads  like  C  that  happens  to  be 
written  in  LISP. 

These  routines  are  sufficient  to  provide  a  cor¬ 
rect  solution,  but  a  major  difficulty  remains.  The 
fi  are  expressed  as  convolution  integrals,  each  de¬ 
pending  upon  the  previous  one  until  the  basis  case 
/i-  Integrating/,-  requires  evaluating i  at  many 
points,  at  each  of  which  /i_2  must  be  integrated 
and  hence  evaluated  at  many  points,  and  so  on. 
The  number  of  function  evaluations  would  seem 
to  be  exponential  in  i,  an  unacceptable  time  com¬ 
plexity.  However,  it  is  to  be  expected  that  each 
function  fi  might  be  evaluated  repeatedly  at  cer¬ 
tain  points.  If  the  function  values  at  these  points 
could  be  retained,  much  redundant  computation 
could  be  avoided. 

The  higher  order  function  cacheing  is  intro¬ 
duced,  mapping  functions  f{x)  to  mathematically 
equivalent  functions  g{x).  g  operates  by  by  look¬ 
ing  up  its  input  X  in  a  table  [6]  of  (x,  /(x))  pairs.  If 
the  required  value  is  found,  it  is  returned.  If  not, 
/  is  used  to  compute  the  value,  the  table  is  up¬ 
dated,  and  the  computed  value  is  returned.  This 
method  has  the  desirable  property  that  no  prior 
knowledge  of  the  pattern  of  repeated  evaluations 
is  necessary,  making  it  especially  attractive  for  dy¬ 
namic  programming  problems. 

Examination  of  the  definition  of  A{x,i)  re¬ 
veals  that  it  depends  upon  a,  6,  and  0,  so  in 
some  sense  it  is  a  function  of  five  variables,  not 
two.  On  the  other  hand,  in  the  context  in  which 
it  is  used,  namely  in  the  definition  of  5(x,n),  a, 
6,  and  0  are  fixed  and  only  x  and  i  vary.  This 
conflict  is  resolved  by  considering  the  definition 
A(x,  i)  —  ...  to  be  the  definition  of  a  higher  order 
function  A(a,  6, 0)  which  maps  values  of  a,  6,  and  0 
to  some  particular  function  A(x,  i).  A  subsequent 
reference  to  A{x,i)  is  then  understood  to  indicate 
such  a  function  A,  rather  than  the  functional  A. 

'I'he  LISP  code  for  the  application  appears 
on  the  last  page  of  this  article.  The  function 
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acceptance-probability-lunction corresponds 
to  A,  mapping  a,  b,  and  6^  to  a  function  of  x  and  t. 
In  addition,  it  takes  n  as  an  input,  to  determine 
how  many  of  the  functions  /,•  need  be  constmcted. 

Before  j4(x,i)  can  be  constructed,  A  must 
first  build  up  the  array  of  functions  /,.  Each 
ft  after  the  first  involves  the  construction  of 
three  functions  at  run  time:  first  the  integrand 
—  y  —  0)  is  created  programmatically, 
it  is  integrated  from  o  to  6  with  respect  to  y,  leav¬ 
ing  X  as  a  parameter,  and  finally  the  integral  is 
mapped  onto  a  cacheing  version  to  prevent  redun¬ 
dant  integration.  With  the  /<  in  place,  A(x,i)  is 
simply  constructed  as  the  integral  of  a  program¬ 
matically  created  function  which  involves 

The  function  5(x,  n)  is  simple  enough  that 
the  LISP  function  significance  is  barely  higher 
order,  taking  a  function  v4(i,t),  and  x,  n,  and  b 
ais  inputs.  While  the  argument  for  treating  the 
definition  S(x,  n)  =  ...  as  the  definition  of  a  higher 
order  function  5(i4(x,  i),  6)  could  be  applied,  there 
is  little  lo  be  gained  by  doing  so. 

The  function  confidence-interval  takes 
values  for  x,  n,  a,  6,  /,  and  u,  and  finds  0i  and  0u 
so  that  5(x,  nl0  =  0/)  =  /  and  5(x,  n\0  =  Ou)  =  u. 
The  similarity  of  the  two  subproblems  leads  to  the 
introduction  of  the  internal  function  get-theta, 
which  finds  the  0  corresponding  to  a  particular 
value  of  p.  Since  the  secant  method  only  finds 
zeros  of  functions,  get-theta  constructs  the  ap¬ 
propriate  function  programmatically. 

Discussion 

A  solution  to  a  non-trivial  numerical  comput¬ 
ing  problem  heis  been  developed,  where  the  con¬ 
stituents  of  the  LISP  code  of  the  solution  were 
transliterations  of  either  standard  numerical  tech¬ 
niques  or  statements  of  the  problem.  The  original 
solution  to  this  problem  was  a  C  language  program 
of  about  five  hundred  lines,  which  took  approxi¬ 
mately  one  week  to  write  and  debug.  The  LISP 
version  was  written  in  one  half  day,  and  consists 
(using  normal  indenting)  of  less  than  one  hundred 
lines  of  code,  more  than  half  of  which  is  reusable. 


The  major  cost  of  software  is  the  time  to  write 
and  debug  it.  The  more  compatible  the  language 
of  the  problem  statement  and  the  language  of  the 
implementation,  the  less  work  in  translating  from 
one  to  the  other,  and  the  less  opportunity  for  er¬ 
rors  and  confusion.  Since  LISP  caters  to  the  cre¬ 
ation  and  manipulation  of  functions,  and  mathe¬ 
matical  problems  are  often  posed  in  terms  of  the 
definition  and  use  of  functions,  the  use  of  LISP  for 
numerical  work  is  natural. 
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(defvar  l/rad2pi  (/  (sqrt  (•  2  pi)))) 

(defun  normal-density  (x) 

(•  (exp  (*  X  X  -O.SdO))  l/rad2pi)) 

(defvar  normal-coefficients 

>(0.0498673470d0  0.0211410061d0  0.0032776263d0  0 . 0000380036d0  0 . 0000488906d0  0 . 0000053830d0) ) 

(defun  normal-distribution  (x) 

(if  (<  X  0) 

(-  1  (normal-distribution  (-  x) ) ) 

(-  1  (•  0.5 

(expt  (1+  (let  ((xpoB  x) 

(sum  0) ) 

(dolist  (coeff  normal-coefficients  ;  do  for  each  coeff  in  normal-coefficients  , 

sum)  :  ehen  done  return  sum 

(setq  sum  (+  sum  (•  xpoe  coeff))) 

(setq  xpoe  (*  xpow  x))))) 

-16))))) 

(defun  secant  (f  pO  pi  toptional  (epsilon  ld-8)  (imax  20)) 

(lot  ((qO  (funcall  f  pO)) 

(ql  (funcall  f  pD) 

P> 

(dotimes  (i  imax  ;  0  <=  i  <  imax 

(error  “Iteration  count  exceeded."))  ;  error  if  loop  runs  out 
(setq  p  (-  pi  (*  ql  (/  (-  pi  pO)  (-  ql  qO))))) 

(ehen  (<=  (abs  (-  p  pD)  epsilon) 

(return  p))  ;  return  root 

(setq  pO  pi) 

(setq  pi  p) 

(setq  qC  ql) 

(setq  ql  (funcall  f  p))))) 

(defvar  romborg-size  35) 

(defvar  expt4  (let  ((arr  (make-array  romberg-size) ) ) 

(dotimes  (i  romberg-size  ;  0  <=  i  <  romberg-size, 

err)  ;  ehen  done,  return  arr 

(setf  (aref  arr  i)  (float  (expt  4  (1-  i))  l.OdO))))) 

(defun  romberg  (fab  koptional  (epsilon  ld-8)) 

(let  ((rl  (make-array  romberg-size) ) 

(r2  (make-array  romberg-size) ) 

(h  (-  b  a))) 

(setf  (aref  r2  1) 

(♦  h  (+  (funcall  f  a)  (funcall  f  b))  .5)) 

(do  ((i  2  (1+  i)))  ;  2  <=  i , 

(uil)  ;  do  forever  (until  a  return) 

(let  ((temp  rl)) 

(setf  rl  r2) 

(setf  r2  temp)) 

(setf  (aref  r2  1) 

(♦  .5  (+  (aref  rl  1) 

(*  h  (do*  ((kmax  (expt  2  (-  i  2)))  ;  upper  bound 
(k  1  (1+  k))  ;  1  <=  k 

(sum  0))  ;  sum  accumulates 

((>  k  kmax)  sum)  ;  k  <=  kmax,  ehen  done  return  sum 

(setq  sura  (♦  sura  (funcall  f  (+  a  (»  (-  k  .5)  h)))))))))) 

(do((j2(l+j)))  ;2<=j 

((>  j  i))  ;  j  <=  i 

(setf  (aref  r2  j) 

(/  (-  (•  (aref  expt4  j)  (aref  r2  (1-  j))) 

(aref  rl  (1-  j))) 

(1-  (aref  expt4  j))))) 

(setf  h  (/  h  2)) 

(when  (and  (>=  i  3) 

(<=  (abs  (-  (aref  r2  (1-  i))  (aref  r2  i)))  epsilon) 

(<=  (abs  (-  (aref  rl  (1-  i))  (aref  rl  (-  i  2))))  epsilon)) 

(return  (aref  r2  i))))))  ;  return  integral 

(defun  cacheing  (f) 

(let  ((cache  (make-hash-table  :te8t  i’equal))) 

(function 
(lambda  (x) 

(or  (gethash  x  cache) 

(setf  (gethash  x  cache)  (funcall  f  x))))))) 
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(dafun  accaptanca'probability-function  (nab  theta) 

(let  ((densities  (uke-array  (1+  n)>)) 

(setf  (aref  densities  1) 

(function 
(laabda  (x) 

(if  (<“  a  X  b) 

(noraal-density  (-  x  theta)) 

0)))) 

(do  ((i  2  (1+  i)))  ;  2  <”  i 

((>«  i  n))  ;  i  <  n 

(let  ((previous-density  (aref  densities  (1-  i)))) 

(setf  (aref  densities  i) 

(cacheing 
(function 
(laabda  (x) 

(roaberg 
(function 
(laabda  (y) 

(•  (funcall  previous-density  y) 

(noraal-density  (-  x  y  theta))))) 
a  b))))))) 

(function 

(laabda  (x  i) 

(if  («  i  1) 

(-  1  (noraal-distribution  (-  x  theta))) 

(roaberg 
(function 
(laabda  (y) 

(e  (funcall  (aref  densities  (1-  i))  y) 

(-  1  (normal-distribution  (-  x  y  theta)))))) 
a  b)))))) 


(defun  significance  (acceptance-function  x  n  b) 
(♦  (do  ((i  1  (1+  i)) 

(sub  0)) 

((>=  i  n)  sua) 


1  <•  i 

sum  accumulates 
i  <  n,  shen  done  return  sura 


(setq  SUB  (♦  sua  (funcall  acceptance-function  b  i)))) 
(funcall  acceptance-function  x  n))) 


(defun  confidence-interval  (they  x  n  a  b 

(loser  0.05)  (upper  0.95) 
tlO  til  tuO  tul) 

(labels  ((get-theta  (p-value  guessO  guessl) 

(secant 

(function 

(laabda  (theta) 

(-  (significance 

(acceptance-probability-function 
nab  theta) 

X  n  b) 
p-value) ) ) 
guessO  guessl) ) ) 

(let*  ((mean  (/  x  n)) 

(rad.mt  (sqrt  n)) 

(tlO  (or  tlO  (-  aean  (/  1.75  rad.mt)))) 

(til  (or  til  (-  Bean  (/  2.75  rad.mt)))) 

(tuO  (or  tuO  (+  mean  (/  1.00  rad.mt)))) 

(tul  (or  tul  (*  mean  (/  1.75  rad.mt))))) 

(values  (get-theta  loser  tlO  til) 

(get-theta  upper  tuO  tul))))) 
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THEORY  OF  QUADRATURE  IN  APPLIED  PROBABILITY:  A  Fast  Algorithmic  Approach 


Allen  Don,  Long  Island  University 


ABSTRACT 

The  integral  representation  of  the  moments 
of  a  useful  class  of  probability  density 
functions  is  cast  in  a  canonical  form  in  terms 
of  Gausb-L,aguerre  quadrature.  This  transforms 
the  continuous  integration  into  a  sum  of 
discrete  terms,  effectively  removing  the 
integral  sign  and  exposing  the  parameters  to 
numerical  investigations.  This  allows  moments 
from  data  to  be  related  to  the  unknown 
parameters  via  a  system  of  non-linear  equations. 
This  system  is  easily  and  quickly  solved  for  the 
unknown  parameters  by  any  of  the  numerous  non¬ 
linear  equation  algorithms  available  for 
personal  computers  and  main-frames.  In  addition, 
the  factorials  and  gamma  functions  found  in 
closed  form  theoretical  moment  expressions  and 
in  density  functions  are  discretized  in  the  same 
manner,  enabling  unknown  parameters  within  the 
arguments  of  the  gamma  to  be  included  in 
numerical  searches.  A  dominant  ratios  method  is 
introduced  for  determining  initial  conditions 
for  the  system  of  non-linear  equations  to 
overcome  the  notable  lack  of  convergence  found 
in  non-linear  system  algorithms  when  initial 
conditions  are  not  well-chosen.  The  theory  is 
connected  to  reliability  problems  to  show  a  fast 
algorithmic  approach  rather  than  the  usual 
graphical  approach  to  parameter  identification 
of  density  functions  both  for  truncated  and  for 
full  data. 

1.0  INTRODUCTION 
1 . 1  The  Problem 

Determination  of  parameters  for  probability 
distributions  from  data  is  hindered  by  the 
analytical  expressions  for  the  moments  being 
under  the  integral  sign, 


n^= f  x"f{x) 


sign.  For  the  Weibull,  the  n 
origin  is 


moment  about  the 


.3.  v/ 


.  b-1  -ax  n.  1 
abx  e  X  dx=  - 
a 


The  gamma  density  function  has  a  slightly 
different  problem:  the  parameters  are  both  under 
the  integral  and  within  the  gamma  function 
argument.  Note  that  TTm+1)  =  m!  when  the 
argument  is  integral. 


m+1  m  -ax  n. 
a  X  e  X  dx  = 


where  order=m+l,  and  M  represents  the  n 
n 

moment  about  the  origin.  Clearly,  the  full 
moment  expression  has  a  closed  form  solution 
which  is  easily  solved  by  a  system  of  nonlinear 
equations: 


The  sublety  in  using  this  expression  is  in 
the  method  of  choosing  the  "extra"  value  of  a 
in  setting  up  the  system  of  non-linear 
equations.  As  an  example,  with  three  moments, 
two  values  of  a  and  one  value  of  m  can  be 
obtained,  but  a  third  value  of  a  is  required 
in  the  non-linear  formulation.  Thus,  the  three 
equations  are: 


for  full  moments,  and 
T 


M  =rx" 

ns  J 


f (x)dx 


for  truncated  moments.  Otherwise,  it  might  be 
possible  to  have  a  fitting  scheme  by  a  system  of 
equations  representing  the  moments  on  one  side 
of  the  system,  and  the  data  representing  the 
moments  on  the  other  side.  In  addition,  the 
density  function  f(x),  in  a  number  of  useful 
digtrlbutions  Involves  the  gamma  function 
](n,m,b)  or  factorial  in  which  the  parameters 
and  moment  designation  are  arguments  within  the 
gamma  sign,  hence  are  intractable  for  use  as 
variables.  Thus,  the  Inverse  problem,  that  of 
obtaining  the  parameters  of  a  distribution, 
given  the  moments  calculated  from  data,  is 
rendered  difficult.  This  is  typified  by  the 
Weibull  distribution  which  has  both  problems.  As 
in  all  continuous  distributions,  the  moment 
expression  is  under  the  Integral  sign;  in 
addition,  the  closed  form  solution  for  full 
moments  contains  the  gamma  function  with  the 
shape  parameter  as  an  argument  within  the  gamma 


where  the  geometric  mean  of  a^  and  a2  is 

(1.9)  a  =  (a,a-)^^^  . 

gm  I  Z 

While  the  closed  form  solution  to  the 
gamma's  full  moments  is  simple  enough  as  seen 
above,  it  will  be  seen  later  that  the  tnaicated 
moments  have  the  "order"  parameter  within  the 
argument  of  gamma  function,  and  therefore 
susceptible  to  the  same  approach  as  the  Weibull. 

For  pedagogic  reasons,  the  gamma 
distribution  and  Erlang  distribution  will  appear 
herein  to  be  identical  except  that  the  order 
parameter  of  the  Erlang  will  be  understood  to  be 
integer  whereas  the  equivalent  for  the  gamma  is 
understood  to  be  real.  Occasionally,  the  Erlang 
nomenclature  might  be  used  when,  in  fact,  the 
search  for  the  order  parameter  yields  a  real 
number  for  the  best  fit. 

A  common  approach  to  determination  of 
parameters  of  distributions  in  reliability  is  to 
use  judgement  in  selecting  the  model  to  which 
data  is  to  be  fitted,  then  use  probability  paper 
for  that  model.  A  straight  line  on  the 
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probablity  paper  indicates  that  the  correct 
choice  has  been  made.  Computer  power,  with  an 
engineering  work-station  becoming  commonplace  at 
the  engineers  finger  tips,  makes  it  is 
appropriate  to  implement  probability  research  of 
all  kinds  in  a  more  automated  manner. 

2.0  BASIC  THEORY 

2.1  General  Principles 

Quadrature  effectively  removes  the  integral 
sign  and  exposes  the  moment  expression  and  its 
parameters  to  numerical  search  methods, 
oo  P 

(2.1) M^=  J  x”f (x,m,a,b)dx=  ^  C^f (x^ ,m,a,b)x" 

0  i=l 

where  C^  and  Xj  are  Christoffel  weights  and 

knots  foi  Gauss-Laguerre  quadrature  [1,  p.105], 
and  p  is  the  number  of  points (weights  and 
knots) . 

The  arguments  under  the  gamma  symbol  are 
"uncovered"  in  a  like  manner  since  the  gamma  and 
factorial  functions  are,  in  fact,  derived  from 
integral  expressions  as  in  (2.4)  below. 

A  system  of  non-linear  equations  is  set  up, 
P 

(2.2)  ^  Cjf(xj,m,a,b)x^=  M^,  n=l,2,...k 

i=l 

where  k  is  the  number  of  equations,  hence  also 
the  number  of  unknowns  being  sought.  Therefore, 
k  data  moments  must  be  used. 

The  right  hand  side  of  the  system  is  the 
data(moments  calculated  from  data) .  The  left 
hand  side  of  the  system  is  the  quadrature 
representation  of  the  integral  moment 
expressions. 

Moments  about  the  origin  are  used.  M^  is 

the  zeroth  moment  representing  the  cumulative 
distribution  function.  Hence,  for  truncated  data 
applied  to  reliability  theory,  M^  represents  the 

fraction  of  units  failed. 

Gauss-Laguerre  quadrature  transforms  the 
continuous  Integration  of  functions  of  the 

kernel  e  ^  to  a  summation  of  discrete  values, 
oo  P 

(2.3)  f  e'^^t^dt  =  ^  Cjtj  , 

0  1=1 

which  is  exact  for  n=2p-l,  when  n  Is  Integer. 

Therefore,  the  gamma  function  is  related  to 
Gauss-[.aguerre  quadrature  by 
oo  P 

(2.4)  n  =  /  e"^t"'^dt=  ^  =(n-l)! 

0  i=l 

where  n  can  be  real  or  integer. 

If  n  were  unknown  and  had  the  numerical 
values  of  the  right-hand-side  of  the  above  been 
given,  then,  using  the  Gauss-Laguerre  table  for 
the  weights  (Christoffel  numbers)  and  knots 
(points  on  the  time  axis)  and  using  a  system  of 
non-linear  equations  solved  by  a  suitable 
algorithm,  the  value  of  n,  the  argument  within 
the  gamma  sign,  would  be  obtained. 

A  useful  class  of  probability  density 
functions,  the  gamma.  Erlang,  Helbull,  Rayleigh, 
and  chi-square,  can  be  cast  in  a  canonical  form 
in  terms  of  Gauss-Laguerre  quadrature  weights 


(Christoffel  numbers)  and  knots.  In  addition, 
the  limitation  of  the  usual  (0,oo)  interval  is 
removed  providing  an  opportunity  to  apply  this 
method  to  other  limits  of  integration,  hence,  to 
truncated  moments  and  to  probability  table 
generation.  Further,  the  relationship  between 
the  normal  and  chi-square  is  exploited  to 
generate  real-time  probability  tables  for  the 
normal  and  for  sums  (convolution)  of  normals. 


2.2  The  Stepping  Up  Concept 

While  the  familiar  parameter  of  the  Erlang 
form  of  the  gamma  function  is  integer,  searches 
will  pass  through  and  most  likely  result  in  a 
fractional  or  real  number  argument.  Also,  the 
unknown  gamma  arguments  being  sought  are  already 
fractional  for  other  density  functions.  Accuracy 
impairment  resulting  from  non-integral  arguments 
and  fractional  arguments  is  remedied  by 
increasing  the  argument  in  unit  steps  while 
simultaneously  externally  multiplying  by  a 
compensating  factor  related  to  the  step 
increases  as  in  (2.5)  below.  The  gamma  function 
identities  for  Integers  and  for  stepping-up  the 
argument  are  well  known  and  found  in  the  most 
abbreviated  mathematical  tables.  The  concept  of 
using  these  identities  or  similar  identities 
either  in  integral  or  in  non-integral  arguments 
as  a  method  of  increasing  accuracy  when  used 
with  quadrature  methods,  is  not  well  known,  if 
at  all.  The  well-known  identities  are: 


(2.5) 


(n)  = 


17: 


(n+l)  _ 

n 


r7n-f2)  _  r7nt3) 

n(n+l)  "  n(n+l)(n+2) 


The  derivation  of  the  gamma  function 
Identities  for  integers  can  be  obtained  from 
repeated  integration-by-parts  (2, pp. 201-203] . 

The  derivation  for  non- integral  gamma  function 
identities  is  achieved  by  the  same  method.  As 
can  be  seen  from  the  following  Integratlon-by- 
parts  in  (2.6),  there  is  no  real  or  Integer 
restriction  on  n  .  This  observation  is 
necessary  as  a  precursor  to  an  important 
sublety:  ill-conditioning  occurs  when  fractional 
powers  are  encountered  resulting  in  the 
algorithm  wandering  without  ever  converging  to  a 
solution.  Hence,  the  gamma  relationships  of 

(2.5)  must  be  used  to  place  the  powers  to  which 
the  knots  are  raised  within  a  range  in  which  the 
quadrature  method  will  work;  this  is 
demonstrated  by  (2.10). 

The  usual  integration-by-parts  derivations 
seen  In  texts  is  In  a  stepping-down  mode. 


oo 


(2.6)  I  t"e  '•dt  =  -t"e  ^ 


]oo  oo 
+ 


nj  t^'^e'^^dt. 


While  the  first  term  on  the  right  above, 

OO 

-t"e-^ 

0 

vanishes  on  an  Interval  of  (0,oo),  it  becomes 
the  Important  term  later  in  truncated 
distributions  and  truncated  moments.  (2.5) 
stepping-up  is  a  rearrangement  of  (2.6) 
stepping-down . 

The  Important  subtlety  with  respect  to 
fractional  powers  of  n  can  be  demonstrated  by 
examining  the  Integral  and  related  quadrature 
equivalent  for  n=l/2  =0.5  . 
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(2.7)  (0 


.5)  =  ^(j 


Observe  that,  without  a  step-up,  the  power  of  t^ 

is  negative.  Even  with  a  step-up  of  one,  we 
find, 

•'  -  ]7o.5). 


S  V 


which  yields  a  fractional  power  of  t^,  albeit 

positive.  Christoffel  numbers  and  knots  are 
derived  from  a  positive  integer  formulation, 
oo  P 

(2.9)  f  t^e^'^dt  =  ^  C^tj"  ,  n=integer:0,l,2. . 

0  i=l 

Hence,  quadrature  is  not  valid  for  n<0  even 
though  the  gamma  function  itself  is  valid.  Thus, 
for  a  fractional  gamma  argument,  the  stepping-up 
procedure  must  be  initiated  from  the  beginning 
simply  to  bring  the  argument  within  the  range  of 
validity  for  Gauss-Laguerre  quadrature,  to  wit; 

(2  10)  Co  5)  =  -S.. 0^0 -51  ^  Jiku - 

-  0  5  (0.5)(1.5) 

1(3.5) 

■  (0.5)(1.5)(2.5) 


(0.5)(1.5)(2.5) 


While  it  appears  that  the  term  t  of  (2.8), 
has  been  brought  within  a  range  of  validity,  the 
errors  are  quite  large.  A  combination  of 
additional  step-ups  and  an  increased  number  of 
points  (weights  and  knots)  must  be  used.  If  the 

density  function  contains  an  ra  "  order 
polynomial,  and  the  number  of  step-ups  equals  s 

and  we  are  dealing  with  the  n^^  moment. 


(2.11)  e  '^f(t'")t"dt  , 


so  that  the  weights  and  knots  can  be  used 
directly  from  the  Gauss-Laguerre  tables,  divided 
by  the  parameter  a  . 

Further,  in  the  entire  family  of 
exponentially  related  density  functions,  the 
parameter  a  by  which  Cj  is  divided,  l.e., 

C^/a,  is  cancelled;  hence,  the  weights  are 

precisely  those  from  the  table.  This  can  be  seen 
by  using  (2.13)  together  with  the  gamma 
formulation  of  (1.4) 


(2.14) 


„  /  m+1  m  -ax  n . 

H^=  I  a  X  e  X  dx 


P 

,  ^  j  \  ni+n 

^  ^  y  £i  M 

ml  a  Va  ; 

P  P 

2.4  Change  of  Variable  leading  to  Canonical 

Form 

The  Weibull  distribution  becomes  tractable 
by  two  changes  of  variables:  u=t^  and  x=au, 

(2.17)  f(t)  =  abt^'^e"^^  =  ae'^'^=  ae~^ . 


(2.18) 


V  / '  ■ 


abt^^-'e-"^  dt 


/  n/b  -aUj 
=  J  u  ae  du 


oo 

■/©" 


0  0 

For  convenience  in  manipulation  and  programming, 
it  is  useful  to  use  the  reciprocal  of  b, 

i.e.,  r  =  c  ,  so  that 
b 

oo  .  P  ,  .nr 

v/(af  =  Z 


then  m+n+s  <  2p-l;  hence, 

(2.12)  P  > 

Therefore,  p  is  the  least  number  of  points 
that  must  be  used.  However,  for  improved 
accuracy,  it  is  always  wise,  and  simply 
accomplished,  to  go  beyond  this  minimum  number 
of  points. 

2.3  Application  to  Density  Functions 

Introduction  of  the  parameter  a  into  the 

kernel  e  and  using  a  change  of  variable  x=at, 
the  integration  provides  the  following 
quadrature  expression. 


oo  ,  xn 

(2.13)  f  e-^^t"dt  =  f 
0  0 


Clearly,  when  r=l,  i.e.,  b=l,  this  is  the 
exponential  distribution  (Erlang  order=l).  So 
the  relationships  are  sufficiently  similar  to 
give  an  indication  that  a  canonical  form  is 
possible. 

The  Erlang  (gamma)  distribution  requires 
only  the  simple  change  of  variable  x=at  to 
change  from  the  integral  form  to  the  quadrature 
form,  so  that 


(2.20)  n 


m+1  ra 

a  X  -ax  n  . 

PT -  e  X  dx 

[(m+1) 

P  /t 

-  S  v:  (y 


where  Erlang  order  =  m-1  .  The  difficult  problem 
of  handling  the  gamma  function  (factorial)  in 
the  denominator  of  (2.20)  in  a  numerical  search 
Is  overcome  by  the  manner  In  which  the 


"uncovered"  ganina  function  is  introduced  as  part 
of  the  numerical  procedure.  The  "uncovered" 
gamma  function,  i.e.,  the  gamma  function  in 
quadrature  form,  is  used  as  a  multiplier  in 

(2.21)  for  the  data  moments  rather  than  as  a 
divisor  for  the  quadrature  moment  expression  as 
in  (2.20). 

This  multiplier  to  the  data,  hereinafter 
called  Hodicatlon  factor  and  atjbreviated  HOD,  is 
nothing  other  than  the  factorial  or  gamma 
function,  so  that 

P  ft 

(2.21)  =  V-"'  = 

i=l 

the  HOD  is  the  gamma  function  in  quadrature  form 

(2.22)  HOD  =  ^  C^tj™  =  |7m+l)  =  m!  , 

i=l 

or  its  stepped-up  equivalents  as  in  (2.5)  and 
applied  in  the  same  manner  as  in  (2.10)  and 

(2.23) . 

Thus,  an  Erlang  of  order  m  would  appear 
in  a  system  of  non-linear  equations  together 
with  the  gamma  function  as 

^  /t  n" 

<2-23)  X 

i=l 


parameter  r  is  unknown;  n  is  known  because  it 
is  the  particular  number  of  the  moment 
specified.  In  (2.22),  m  is  the  unknown  order 
parameter. 

3.0  CANONICAL  FORHS  AND  NOTATION 

3.1  Full  Homent  Canonical  Form 

The  following  quadrature  canonical  forms 
are  presented  for  the  exponential  family  of 
density  functions  including  the  Rayleigh.  The 
Identical  approach  can  be  used  for  the 
chi-square,  hence  for  the  normal  as  a  by¬ 
product  . 

In  addition,  the  full  form  is  shown 
separately  from  the  finite  interval  form;  again, 
these  could  be  combined  but  are  shown  and 
discussed  separately  for  clarity. 

The  truncated  model  removes  the  limitation 
of  the  usual  (0,oo)  interval  for  Gauss-Laguerre 
quadrature  by  providing  a  method  for  the  tails 
(T,oo)  and  intervals (T^ ,T2)  and  (0,T). 

In  addition,  the  subtlety  pointed  out 
earlier  regarding  accuracy  when  fractional 
powers  are  encountered  is  extended  to  the 
canonical  form  with  its  plethora  of  parameters, 
whereas,  in  the  earlier  sections,  the 
application  was  to  the  simple  gamma  function. 

The  Full  Homent  Canonical  Form  is 

(3.1) 


=  n_ 


m+l 


I 

1=1 


Citi 


ro+1 


In  the  above  expression,  the  summation  indices 
are  p  and  q  on  the  left  and  right  sides, 
respectively,  to  indicate  that  the  number  of 
quadrature  points  used  on  the  left  side  and  on 
the  right  side  do  not  necessarily  have  to  agree. 
In  addition,  on  the  right  side,  the  gamma 
function  is  shown  with  a  step-up  of  1. 

Additional  step-ups,  more  points,  or  both,  will 
improve  accuracy. 


2.5  Alternate  Weibull  Fora 

Recall  that  the  right  hand  side  of  (1.3) 
showed  the  closed  form  solution  to  the  Weibull 

moments  which,  with  r=^  will  appear  as 

'2-24)  v(i)""Fl+nr)  , 

and,  since  r  is  real,  so  is  the  product  nr  . 

Using  the  HOD  which  is  as  valid  for  real 
arguments  as  it  is  for  integer  arguments,  the 
following  expression  becomes  available  for  use 
in  a  system  of  non-linear  equations. 

(2.25)  =  H^  /  HOD, 
where 

(2.26)  HOD=  X  ^  X  "l^i""'' 

1=1  1=1 

=  [7l+nr) 

with  a  single  step-up  as  above  or  with  a  higher 
order  step-up. 

This  form  uses  quadrature  to  uncover  the 
argument  in  the  the  gamma  function  [(l+nr)  in  a 
manner  almost  identical  to  [(m+l)  of  (2.22). 
The  pro^ct  nr  is  unknown  since  the  shape 


0+m 

M  »  ^  —  - 

n“  ml (ra+nr+1) (m+nr+2) . . (m+nr+Q) 


3.2  Finite  Interval  Canonical  Form 

The  following,  which,  at  first,  seems  to  be 
limited  to  the  Interval  (T,oo)  will  be  found  to 
be  applicable  to  (0,T)  as  will  be  shown  hence. 
(3.2) 


M 


nt 


gO+m  ,jm+nr+Q 

( m+nr+ 1 ) ( m+nr+2 ) . . ( m+nr+Q ) 


,Q+m 


m+nr+Q 


(m+nr+1) (m+nr+2) . . (m+nr+Q) 


i=l 

3.2.1  Notat ion 

The  notation  used  is  as  follows: 
Nomenclature: 

n^'^  Full  Homent  about  origin. 


H  =  CDF 
o 

H^^  n^*^  Truncated  Homent  (0,T) 

H^^  n^^  Tail  Homent  (T,oo) 


Q=k 


ra  Erlang  order  minus  one, 

i.e.  order =m+l 
m=0  for  exponential 

(Erlang  order=l) 
m=0  for  Weibull 

a  Parameter  related  to  time  constant 

r  Inverse  of  Weibull  shape  parameter  b. 
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r=l/b 

r=l  for  Exponential  or  Erlang 
r=.5  for  Rayleigh, 
p  Number  of  quadrature  points 

Cj.tj  Christoffel  nuinbers  and  knots 

Q  Step-up  index 

0  is  integer,  limited  to 
Q  <  2p-nr-l-ni 
k  Order  of  step-up 

A 

T  ^  Real-t ime 

T=T  for  gamma,  exponential,  and  Erlang 

T=T*’  for  Weibull 


Truncation  point 


Tj,T2  Finite  Interval, 


3.3  Finite  Interval  Quadrature  -  Derivation 

In  a  manner  similar  to  the  two  changes  of 
variable  used  to  obtain  the  full  moment 
quadrature  form,  the  finite  interval  quadrature 
requires  two  as  well,  but  the  second  change  of 
variable  introduces  the  finite  time  "T". 

The  terminology  will  be  consistent  with  the 
canonical  form  in  that, 

(3.3)  , 
ns  nt  n 

whence  H  is  the  truncated (short)  mean  and 

ns  nt 

is  the  mean  of  the  tall. 

There  are  two  forms  for  the  truncated  mean. 

The  first  is  the  series  expression  discussed  in 

connection  with  the  canonical  form,  the  second 

form  uses  quadrature  without  the  series  term.  It 

is  this  second  form  which  will  be  discussed  now. 

First,  the  density  function  and  the  moments 

of  the  Weibull  are  cast  into  the  following  form 

by  the  change  of  variable  u=x^,  so  that  T=T*^  in 
accordance  with  the  notation  in  the  nomenclature 
section 

b-l  -a  ^ 

(3.4)  f(x)  =abx“  ^e  “  dx  becomes 

f(u)  =ae  ®”du 


(3.5) 


I  abx^^e~^^  x^dx  becomes 


,,  ,  I  -au  (n/b)  . 
f(u)=  I  ae  u  du 


Using  a  second  change  of  variable, 
t=a(u-T),  thence  u=  -  +  T  ,  and  using  r=l/b 

3 

for  convenience,  the  moments  become. 


/  ^  £  V  ii  nr  • 

(3.6)  ae  u  du 


C,  —  =  M_*MODi 


where  , 

MODI  =  -4^  = 

MOD  ( nr ) I 

and  MOD  is  as  in  (2.22)  with  nr  replacing 
so  that 


m 


i=l 


If  dealing  with  reliability,  M  are  the 

ns 

truncated  data  moments  with  observations 

terminating  at  time  "T"  with  M  being  the 

os 

zeroth  moment,  the  CDF. 

The  same  result  will  be  obtained  by  the 
alternate  quadrature  form. 


i=l 

P 


So,  for  truncated  data,  two  quadrature 
forms  and  one  series  form  is  available. 

4.0  INITIAL  CONDITION  PROBLEM 

4.i  Inconsistent  Initial  Conditions 

The  quadrature  systems  developed  herein 
have  the  unknown  variables  embedded  as  arguments 
of  exponentials.  The  argument  can  contain  two 
unknowns,  one  a  power  of  the  other.  During  a 
search,  one  variable  going  negative  raised  to  a 
non-integer  power,  which  is  the  second  variable, 
will  terminate  the  search  since  the  result  would 
be  an  imaginary  number.  The  immediate  response 


to  this  problem  might  be  to  prevent  the 
offending  variable  from  going  negative  during 
the  search.  Unfortunately,  this  also  destroys 
the  integrity  of  the  rate-of-change  vector 
matrix.  Also,  convergence  to  the  correct  result 
depends  upon  the  exponential  argument  remaining 
negative  during  the  search.  The  value  of  the 
exponential  expression  blows-up  when  the 
argument  becomes  positive  and  convergence  is 
never  achieved. 

If  either  parameter  is  chosen  inconsistent 
with  respect  to  the  other,  the  algorithm  wanders 
indefinitely  and  never  converges,  or  terminates 
when  one  variable  becomes  negative.  With  a 
choice  of  consistent  parameters,  but  initially 
too  far  removed  from  the  correct  solution,  again 
the  algorithm  will  not  converge. 

Therefore,  to  be  useful,  a  method  must  be 
Introduced  to  choose  properly  the  initial 
conditions. 

A  consistent  initial  condition  is  one  which 
relates  the  unknown  variables  by  a  basic 
relationship  applicable  to  the  particular 
distribution  being  modelled.  For  example,  the 

b 

CDF  relationship  for  the  Weibull,  W^=l-e  ^ 

relates  the  parameters  a  and  b  ,  given 
knowledge  of  which  is  the  fraction  of  units 

failed  in  time  X  .  Therefore,  an  unrealistic 
Initial  value  of  b  could  be  selected  together 
with  a  consistent  value  of  a  computed  by  the 
CDF  relationship. 

Hence,  two  chores  must  be  accomplished 
simultaneously,  that  of  realistic  and  of 
consistent  initial  conditions. 


4.2  Dominant  Ratios  Method.  Weibull.  for 
Approximating  Initial  Conditions 

The  Weibull  truncated  series  is, 
k 


(4.1)  (1^(0, T)  = 


Tnr^a°e~^'^ 

(nr+1 ) (nr+2) . . (nr+Q) 


The  first  terra  of  the  series  is  domin..  .  For 
any  moment,  the  first  term  has  the  same  relative 
degree  of  inaccuracy;  therefore,  the  ratio  of 
the  first  terms  is  much  more  accurate  than  the 
values  of  the  first  terras  themselves;  l.e., 

H,/M  ,  and  M-/H,  . 

X  O  Z  1 

The  first  terms  are,  for  ,  M,  ,  and  M- 


(4.2)  N^(0,T)  = 


(4.3)  H^(0,T)  = 


e~^'^(aT^) 


-aT,  „r+l, 
e  (aT  ) 

r+1 

.-aT,_.,2r+l, 


(4.4)  ^ 

so  that 

i  . 

e  aT 

With  X=T  and  r=l/b  ,  where  X  Is  real-time,  we 
find 

V”o  =  -F^  • 

Solving  for  r  , 

n 

(4.7)  r  =  X  -  1  . 


Thus,  the  parameter  r  for  the  Weibull  is 
found:  it  is  time-dependent  as  well  as  a 
function  of  and  .  This  provides  a  very 

good  initial  estimate  of  the  Weibull  parameter 
b  since  b=l/r  . 

With  knowledge  of  a  realistic  initial  value 
of  b  ,  a  consistent  value  of  a  can  be  found 

-ax*" 

by  the  CDF  relationship,  •  Solving 

for  a 

(4.8)  a  =  -  ii5v.(l-M^)/  x*^ 

It  can  be  shown  that  the  series  of  (4.1),  when 
taken  to  an  infinite  number  of  terms,  is 
precisely  the  CDF;  that  is,  for  n=0,  (4.1)  is 

-Q  Q  -aT 

(4.9)  CDF  =  (1  (0,T)  =  )  -  - 

o  Zj  Q! 

(J=l 
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THE  PROBABILITY  INTEGRALS  OF  THE  MULTIVARIATE  NORMAL: 


THE  2"  TREE  AND  THE  ASSOCIATION  MODELS 
Dror  Rom,  Biometrics  Research,  Merck  Sharp  &  Dohme 
Sanat  K.  Sarkar,  Temple  University 


1.  Introduction 


/ 1 

Pl2 

■■  Pln\ 

P2i 

1 

■  ■  P2n 

VPnl 

Pn2  ■ 

1  y 

The  standard  multivariate  normal  density  has  the  fol¬ 
lowing  form: 

/(z)  =  (2^|E|)-Jexp(-iz'E-"*) 

where 


E  = 


The  evaluation  of  the  probability  integrals  of  the  multivari¬ 
ate  normal  distribution  is  of  great  importance  to  statisti¬ 
cians.  The  joint  distribution  of  several  random  variables 
is  often  sissumed  to  be  multivariate  normal.  This  distribu¬ 
tion  also  provides  an  approximation  to  many  multivariate 
distributions  including  the  multinomial  distribution,  when 
the  sample  size  is  large.  Most  of  the  work  so  far  concen¬ 
trated  on  either  one  of  three  cases:  The  evaluation  of  the 
bivariate  and  trivariate  normal  probability  integrals;  the 
evaluation  of  the  multivariate  normal  probability  integrals 
for  special  cases  of  the  correlation  matrix;  the  evaluation 
of  the  multivariate  normal  probability  integrals  for  special 
domains.  Not  much  work  was  done  to  achieve  a  resonable 
technique  for  the  evaluation  of  the  probability  intcgarls  of 
a  general  multivariate  normal  distribution  on  any  (rect¬ 
angular)  region. 


The  techniques  for  the  evaluation  of  the  multivariate 
normal  probabilities  can  be  categorized  as: 

(1)  Expansions  of  the  density  in  power  series. 

(2)  Reduction  to  lower  dimensions  and  then  using  quadra¬ 
tures. 

(3)  Modeling  the  probability  surface  (log-linear  models  for 
example). 

(4)  Monte-Carlo  integration  techniques. 

2.  The  Contingency  Tables  And  Association  Mod¬ 
els. 


Goodman  (1981)  developed  the  following  association 
model:  For  a.n  I  x  J  contingency  table,  let  F,y  denote  the 
expected  frequency  in  the  ith  row  and  jth  column  of  the 
table  (i  =  l,...,/;y  =  1,...,J).  Consider  the  following 
model  for  the  expected  freq  lencies 

F,;  =  (1) 

where  Oi,  /?;  ,/z,,  Uj  and  <(i  are  parameters.  Let  0,j  denote 
the  local  cross  product  ratios  given  by 

"  (^i; +  1  ,j  f  1  )/(fi,;  +  1  Fn  i,j ).  (2) 

(«  1 - /  l;j  1 - J  -  1) 


From  (l)  and  (2)  we  obtain 
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'Kj,  =  logBij  =  ~  t'j+i)-  (3) 

If  we  let  m  —  fii+i  =  A  (t  =  1, . . . ,  /  -  1),  t'j  —  i^i+i  =  A', 
(j  =  1,..., J  -  1),  (where  A  and  A'  are  unspecified)  we 
obtain  the  uniform  association  model.  Holland  and  Wang 
(1987)  investigated  the  extension  of  'Hj  to  continuous  bi¬ 
variate  densities  /(x,y).  They  showed  that  the  limiting 
case  of  is  the  bivariate  function 

~1f(x,y)Ax  Ay  =  -^^log  f[x,y)  Ax  Ay.  (4) 

They  called  7/ (i,  y)  the  local  dependence  function  for  f{x,y) 
For  the  bivariate  normal  density,  the  local  dependence 
function  is  7/(x,  y)  =  pf{l  —  p^).  The  logarithm  of  the  lo¬ 
cal  cross  product  ratio  for  the  four  infinitesimal  rectangular 
region  around  [x,  y),  (x,  y  +  6)  (i  +  a,  y)  and  (x  -f  a,  y  +  6) 
is 

{pI(1- p^)}ab.  (5) 

which  is  independent  of  (x,y).  Let 

<t>  =  P/(1  -  p'^) 

where 

p  =  p{(l  -6^V12)(1-«.V12)}'^^  (6) 

and  6^  and  6^  are  the  widths  of  the  corresponding  row  and 

column  categories.  Note  that  (6)  is  Sheppard’s  correction 
for  grouped  data  (Kendall  and  Stuart  (1976)).  Model  (1) 
b»‘«-ornGs; 

F.,  (7) 

In  this  form,  o,  and  Pj  can  be  looked  at  as  main  effects 
in  the  model,  and  p,  and  Uj  are  the  centers  of  the  ith 
row  and  jth  column  respectively.  Wang  (1987)  showed 
that  (7)  has  approximately  the  same  local  cross  product 
ratios  as  does  the  bivariate  normal  density.  By  Theorem 
2.1.1  of  Wang  (1987),  if  model  (7)  is  used  to  approximate 


the  bivariate  normal  probabilities,  then,  if  the  resulting 
contingency  table  will  have  marginals  which  are  univari¬ 
ate  normal  probabilities,  the  cell  frequencies  will  fit  well 
the  bivariate  normal  probabilities.  However,  since  in  gen¬ 
eral,  a,-  and  /Sy  are  not  known,  model  (7)  can  not  be  used 
directly.  Hence  both  Goodman  (1981)  and  Wang  (1987) 
use  the  proportional  fiting  algorithm  to  obtain  the  cell  fre¬ 
quencies.  Noticing  That  if  we  drop  a,  and  /?y  from  model 
(7),  the  cell  frequencies  will  still  have  the  same  local  cross 
product  ratios  as  does  the  corresponding  bivariate  normal 
density,  we  can  use 


as  starting  values  for  the  proprtional  fiting  algorithm.  T'.iis 
procedure  cycles  alternately  between  row  scalings  and  col¬ 
umn  scalings  until  both  row  totals  and  column  totals  have 
been  matched.  Bishop,  Fienberg  &  Holland  (1975)  showed 
that  for  complete  two-way  table  this  algorithm  always  con¬ 
verges. 

Drawbacks  of  The  Proportional  Fitting  Algorithm. 

The  contingency  table  approach,  although  provides 
an  interesting  application  of  the  proportional  fiting  algo¬ 
rithm  in  improving  the  existing  methods  for  computing 
bivariate  normal  probabilities,  has  some  major  drawbacks. 
Among  them  are  the  following: 

(1)  A  preliminary  set  of  the  univariate  normal  probabilities 
is  needed. 

(2)  An  undetermined  number  of  iterations  until  conver¬ 
gence.  This  is  shown  in  Wang  (1987).  The  number  of 
iterations  until  convergence  is  3  when  p  is  0.05,  while  the 
number  of  iterations  until  convergence  is  30  when  p  is  0.95. 
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(3)  The  procedure  always  generates  a  full  contingency  ta¬ 
ble  since  it  has  to  match  the  marginal  probabilities  with 
the  univariate  normal.  This  requires  a  substantial  number 
of  computations  even  when  the  probability  over  a  small 
rectangle  is  needed. 

(4)  The  procedure  requires  a  lot  of  memory  space  to  store 
all  cell  probabilities,  especially  as  the  dimensionality  in¬ 
creases. 

(5)  It  is  not  easily  extended  to  higher  dimensions. 

In  the  next  section  we  will  show  that  these  drawbacks  can 

be  overcome  by  utilizing  a  general  linear  model  together 
with  the  2"  Tree  technique. 


a  method  for  computing  all  F,y  elements  provided  the  di¬ 
agonal  elements  are  known.  This  is  done  by  estimating  the 
main  effects  and  then  re-substituting  them  in  (1). 

4.  Computing  The  Diagonal  Elements  of  a  Contin¬ 
gency  Table. 

Consider  a  two  dimensional  integrable  function  /(xi ,  Xj) 
on  a  closed  interval  [ai,6i]  x  [02,62].  Suppose  we  want  to 
evaluate  the  integral 

rbx  rb2 

/  /  f(iuX2)dxidx2.  (10) 

•/fti  J  a2 

We  propose  the  following  procedure  to  approximate  (8) 
Step  1 


3.  Approximating  The  Main  Effects  In  The  Log- 
Linear  Model. 


Following  Goodman’s  (1981)  log  linear  model  for  the 
bivariate  normal  density  as  presented  in  (1),  we  consider 
now  a  different  way  to  approximate  the  bivariate  normal 
probabilities.  Since  the  standard  bivariate  normal  density 
is  symmetric  in  its  arguments,  it  would  be  natural  to  as¬ 
sume  that  the  main  effects  in  (1)  are  equal,  i.e.,  a,  =  /?,, 
X  =  1, . . . ,  /,  and  /ii  =  1/,,  t  =  1, . . . ,  /. 


Suppose  for  the  moment  that  the  diagonal  elements 
of  the  contingency  table  F,j  are  known.  Then  from  (7)  we 
obtain: 


F..  = 


=  a,  c 


(8) 


Which  holds  for  all  i  =  1, ... ,  I .  From  (1)  we  obtain: 


a,  =  e 


1  (lojF.j 


(9) 


Let 

,  _  (61  -  ai)(62  -  az)  /,,  ,  . 

■'0  —  - (/(“1.02)  +  /(ai,62) 

.  (11) 

+  /(6i,ai)  -t-  /(6i,62)  +  8/(ao,6o)) 

where 

Step  2 

Partition  [oi ,  6]  j  x  [02, 62]  (the  node)  into  four  equally 
spaced  rectangles  (children)  using  (oq,  60)  as  a  pivotal  point. 
Use  (11)  to  approximate  the  integral  over  each  of  the  re¬ 
sulting  rectangles.  Let  /]  denote  the  sum  of  integrals  over 
all  four  children.  Let  6  be  a  convergence  criterion,  then  if 
*  =  |/i  —  fo|  ^  6  deliver  Ii  as  the  approximated  integral 
for  (10),  otherwise  go  to  Step  3. 

Step  3 

Apply  Step  1  and  Step  2  to  each  child  sequentially 
were  the  convergence  criterion  over  each  child  is  6'  = 


(9)  provides  a  way  to  estimate  the  main  effects  as  functions 
of  the  corresponding  diagonal  elements.  This  in  turn  gives 


The  quadrature  in  (11)  is  designed  to  fit  perfectly  a 


second  degree  polynomial,  i.e.,  if  /(ii,i2)  is  a  second  de¬ 
gree  polynomial,  then  (11)  will  be  exact. 

The  above  is  a  recursive  algorithm.  Its  main  property 
is  that  it  concentrates  on  regions  where  the  function  is  not 
well  behaved  and  spends  less  time  elsewhere. 

5.  The  Error  In  The  2"  Tree. 

To  compute  the  error,  we  can  without  loss  of  general¬ 
ity,  tissume  that 

[ai,6i]  X  [02,621  =  [-“.“1  X  |-6,6| 

and 

(uo.6o)  =  (0,0) 


6.  The  Log-Linear  Model  And  The  2"  Tree. 

We  now  combine  the  2'*-Tree  and  the  log-linear  model 
as  follows.  Each  of  the  diagonal  elements  (nodes)  is  parti¬ 
tioned  into  four  equally  spaced  rectangles  (children).  The 
diagonal  elements  are  computed  by  the  above  quadrature 
whereas  the  off-diagonal  elements  are  computed  using  the 
log-linear  model.  We  then  combine  the  area  over  the  chil¬ 
dren  and  compare  with  the  node.  If  convergence  has  been 
reached,  we  stop  partitioning  this  element,  otherwise  we 
partition  each  of  the  diagonal  elements  and  apply  the  same 
procedure.  This  technique  besides  being  fast,  requires  lit¬ 
tle  memory  space  since  only  one  diagonal  element  is  exam¬ 
ined  at  a  time.  This  property  is  extremely  useful  in  higher 
dimensions. 


Now,  we  want  the  error 

j  j  f(xi,X2)dxidx2-'^{f{  a,a)  +  f{-a,b)  +  f{a,-b) 
+  f(a,b)+8f{0,0)).  (12) 


We  expand  f[x,y)  in  Tailor  series  around  (0,0)  and  sub¬ 
stitute  throughout  (12).  All  first  and  second  order  terms 
vanish  as  they  should  for  they  constitute  a  second  degree 
polynomial.  The  third  order  term  in  the  expansion  van¬ 
ishes  by  the  virtue  of  odd  symmetry.  Hence  we  have  (after 
integrating  the  fourth  order  term): 


7.  The  Trivariate  Standard  Normal  Distribution 

Extending  (l)  to  the  trivariate  normal  distribution, 
we  put  the  following  model  for  the  expected  frequencies 

logF.y*  =  (1  -  P23^)6^  +  (1  -  Pl3^)bf  +  (1  -  Pl2^)b^ 

-t-  ^  ^(pi3P23  -  +  (P12P23  -  Pl3)p-i  Pk 

+  (Pl2Pl3  -  P23)pf  Pk  j 

(14) 

where  (  1  P12  PisA 

A  =  det  j  P21  1  P23  I  ■ 

Vp3i  P32  1  / 


Error  = 


-ab^  d*f{xi,X2)  _  ba^  d*f{xi,X2) 
45  dxi*  45  dx2* 

2o^6®  d*f(xi,X2) 

9  3xi*di2^  ^ 


(13) 


Equation  (13)  gives  the  error  when  applying  the  quadra¬ 
ture  (11)  to  the  region  [-u,a]  x  [-6,6|. 


We  assume  that 

6^  =  5,®  =  =  6i  and  =  pf  =  pf  =  Pi  V«. 

Putting  model  (14)  for  the  diagonal  elements  and  solving 
for  we  get 


(log  I^iii  -  M?^) _ 

^  “  (^12  +  Pu  ^23) 


(15) 
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where 


8.  Comparisons 


^  ^  (Pl2Pl3  +  P12P23  +  P13P23  —  (Pl2  +  Pl3  +  P23)) 

As  for  the  bivariate  normal  distribution,  we  can  solve  for 
the  main  effects  provided  the  diagonal  elements  are  known. 
We  again  use  the  2"  Tree  combined  with  the  linear  model 
to  generate  all  diagonal  elements. 


We  compare  here  the  results  obtained  from  the  log- 
linear  models  with  some  known  results.  For  the  bivariate 
normal  probabilities,  we  compare  with  tabulated  probabil¬ 
ities  given  by  Goodman  (1981).  For  the  trivariate  normal 
probabilities,  we  make  this  comparison  with  the  tabulated 
probabilities  given  by  Gupta  (1963)  and  the  exact  results 
for  some  special  cases  of  the  domain  of  integration.  These 
exact  results  were  given  by  David  (1953). 


Table  1 


Probabilities  under  the  bivariate  normal  density  with  p  =  0.5.  First  entry  is  the 


tabulated  probability.  Second  entry  is  the  computed  probability.  The  computation 


were  carried  out  in  a  single  precision  with  maximum  error  less  than  0.0001. 
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Table  2 


Selected  probabilities  under  the  trivariate  normal  density.  Results  from  the  pro¬ 
posed  technique  are  compared  with  tabulated  probabilities  (when  available)  given 
by  Gupta  (1963),  and  exact  probabilities  (when  available).  The  computations  were 


carried  out  in  a  single  precision  with  maximum  error  less  than  0.001. 
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MULTIPLE  SMOOTHING  PARAMETERS 
IN  SEMIPARAMETRIC  MULTIVARIATE  MODEL  BUILDING 
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1.  INTRODUCTION 

Semiparametric  model  building,  particularly  using 
multivariate  splines  of  various  types,  has  the  potential  to 
allow  the  organization  and  analysis  of  large  data  sets  which 
represent  responses  as  a  function  of  several  variables.  One 
of  the  major  stumbling  blocks  to  the  further  development 
of  these  techniques  has  been  the  heavy  and  sometimes 
prohibitive  computational  cost  of  estimating  multiple 
smoothing  parameters.  Some  recent  work  in  Madison  ( Gu 
et  al.  (June  1988),  Gu  (June  1988))  has  resulted  in 
improved  numerical  methods  for  speeding  up  the  calcula¬ 
tion  of  both  GCV  (generalized  cross  validation)  and  GML 
(generalized  maximum  likelihood)  estimates  of  multiple 
smoothing  parameters.  (See  Wahba  (1985)  and  references 
there  for  a  discussion  of  these  estimates.) 

This  work  has  allowed  us  to  explore  the  use  of 
interaction  smoothing  splines  (Barry  (1986),  Wahba 
(1986),Gu  et  al.  (June  1988))  for  multivariate  exploratory 
model  building,  and  also  to  tackle  some  interesting  prob¬ 
lems  concerning  the  merging  of  data  from  different  sources 
with  different  and  only  partially  known  error  structures. 

In  the  first  part  of  this  paper  we  will  briefly  discuss  the 
data-merging  problem  and  in  the  second  part  we  will 
describe  some  recent  model  building  work  with  interaction 
smoothing  splines  with  multiple  smoothing  parameters. 
We  note  that  J.  Friedman’s  extremely  interesting  keynote 
talk  at  this  conference  concerned  what  might  be  called 
interaction  regression  splines.  There  are  philosophically 
interesting  similarities  as  well  as  contrasts  in  his  approach 
and  the  one  we  describe. 

1.  A  DATA  MERGING  PROBLEM  ARISING  IN 
METEOROLOGY 

To  motivate  the  problem,  we  first  describe  a  very  spe¬ 
cial  concrete  case,  then  we  consider  a  more  general  ver¬ 
sion.  Further  details  appear  in  Wahba  (1988). 

Let  P  be  (latitude,  longitude)  and  let/ (F)  be  the  5(X) 
millibar  height,  that  is,  the  height  in  the  atmosphere  at 
which  the  pressure  is  5(X)  millibars.  Every  12  hours  the 
global  radiosonde  (weather  balloon)  network  observes  the 
500  mb  height  and  repons  the  obervations. 

yi”>=/(F,)  +  ei°’  .'=1 - « 


where  the  are  treated  as  independent  random  errors 
with  a  common  variance  Oq.  Here  we  will  treat  all  random 
variables  as  Gaussian  with  0  mean  unless  otherwise 
specified.  Simultaneously,  there  is  a  forecast  of  the  state 
variables  of  the  model,  which  can  be  converted  to  a  fore¬ 
cast  of  /  (P).  Let  this  forecast  be 

yP=f{Pi)  +  zP 

where  e^''  is  the  forecast  error.  The  problem  is  to  merge 
the  observational  and  forecast  data  to  get  a  new  estimate 
for  the  500  mb  height,  which  is  then  used  as  part  of  the  ini¬ 
tial  conditions  for  a  numerical  weather  forecast  model. 
Generally,  the  error  variance  of  the  observations  depends 
on  the  equipment  and  is  known.  The  forecast  error  is  gen¬ 
erally  correlated,  and  depends  on  the  particular  forecast 
model  in  question.  It  has  not  been  as  well  known  as  one 
would  wish.  Recent  work  at  the  European  Center  for 
Medium  Range  Forecasts  (ECMWF)  (Hollingsworth  and 
Lonnberg  (1986),  Lonnberg  and  Hollingsworth  (1986))  has 
provided  some  fairly  detailed  information  on  the  500mb 
forecast  error  covariance  structure  of  the  model  there.  The 
forecast  error  spatial  covariance  was  estimated,  based  on 
analysis  of  three  months  data  comparing  observation  to 
forecast.  Use  of  these  results  can  be  used  to  "retune"  the 
estimation  of  initial  conditions  (which  is  then  of  course 
going  to  change  the  forecast  error  covariance.)  The  fore¬ 
cast  error  can  depend  on  many  things,  including  the 
weather  itself,  and  we  were  interested  in  seeing  wether  use¬ 
ful  information  concerning  forecast  error  covariance  could 
be  obtained  dynamically,  that  is,  from  one  instantaneous 
set  of  500  millibar  height  observations, which  consist  of  the 
order  of  600-1000  observations,  and  one  global  forecast.  If 
it  can,  then  the  information  can  be  fed  back  into  the  model, 
to  improve  the  estimation  of  the  initial  conditions.  In 
Wahba  (1988)  the  forecast  error  covariance  is  modelled  by 

=a}Qe{P,,Pj). 

where  Qe{-,  )  is  an  isotropic  correlation  function  on  the 
sphere  depending  on  the  single  parameter  0  and  defined  by 

Qe(P,,Pj)  =  Pe(yfPnPj)) 

, ,  (i-2ecosYfe^)-''^-(i-he)^' 

(i-0r>-(i+0)-> 

where  yis  the  angular  distance  between  F,  and  Pj. 
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Figure  1  is  a  plot  of  pefy)  for  seven  values  of  0.  6  is  a 
monotone  function  of  the  1/2  power  point  Yi/2  (the  dis¬ 
tance  for  which  the  correlation  is  down  to  1/2),  and  Yi/2  is 
probably  the  single  most  important  parameter  (in  a  practi¬ 
cal  sense)  of  an  isotropic  correlation  function  on  the 
sphere.  This  family  of  correlation  functions  was  chosen  for 
mathematical  convenience  and  for  the  resemblance  of 
some  of  its  members  to  correlation  functions  in  Hollings¬ 
worth  and  Lonnberg  (1986),  Lonnberg  and  Hollingsworth 
(1986). 


Y,  radians 

Fig.  1.  pa(y),  for  seven  different  values  of  0. 


Making  the  reasonable  assumption  that  the  and 
are  independent,  then  one  can  estimate  of,  r  =  — j- 
and  0  by  maximum  likelihood,  by  considering 

then  z=(z  i , . . . ,  z„)'  has  the  distribution 
z  -N(0,a}(r/  +  Qe)) 

where  Ge  is  the  nxn  matrix  with  ij'th  entry 

The  maximum  likelikhood  estimates  of  r  and  0  can  be 
shown  to  be  the  minimzers  of 


M(r,0) 


z'(r/  +  G0)'‘z 

[det(r/  +  G0)‘‘]" 


and  the  ML  estimate  of  aj  is 


o}  =  -z\r/  +  Qs)-'z 
'  n 

where  r  and  0  are  the  ML  estimates  of  r  and  0. 

An  efficient  algorithm  suitable  for  minimizing  M(r,Q) 
(as  well  as  the  GCV  function,  to  be  discussed  later)  with 
large  data  sets  has  been  proposed  by  Gu  et  al.  (June  1988), 
and  some  code  is  available  in  Gu  (June  1988).  An  outline 


of  the  algorithm  goes  as  follows: 

i)  For  fixed  0,  tridiagonalize  Ge  as 
u‘^Q6U=Ta 


where  U  is  orthogonal  and  T  is  tridiagonal,  by  successively 
applying  the  Householder  transformation.  A  strategy  for 
speeding  up  this  step  by  appropriate  tnincation  which  sets 
suitably  small  elements  of  the  diagonal  of  Te  to  0  appears 
in  Gu  et  al.  (June  1988)  and  is  in  the  code  in  Gu  (June 
1988).  Then 


M(r,0) 


h'(rI+TQ)-^h 
[denrl +  Tey^]'' 


where  h=Uz. 


ii)  For  each  trial  value  of  r,  do  a  Cholesky  decomposi¬ 
tion  C  C  of  (rl  +  T e).  where  C  is  upper  bidiagonal. 


an-\  bn-i\ 


Hi)  The  numerator  of  M  is  then  computed  by  back 

2_ 

^  n 

suostitution  and  the  denominator  as  (H^i)  • 

i=i 

iv)  For  fixed  0  conduct  a  search  in  logr,  then  step  to  a 
new  0. 

Most  of  the  work  is  in  the  tridiagonalization,  thus,  a 
search  is  "cheap"  in  r  and  expensive  in  0. 


This  algorithm  has  allowed  us  to  ask  practical  ques¬ 
tions  based  on  realistic  simulated  data,  with  n=600  or 
more,  using  our  Sun  workstation.  Such  questions  as,  can  r 
be  estimated  sufficiently  accurately  given  one  set  of  data, 
to  make  the  method  useful  in  a  pracacal  sense,  for  adjust¬ 
ing  the  relative  weights  to  be  given  in  observation  and  fore¬ 
cast  when  merging  them  to  get  new  initial  conditions.  For 
example.  Figure  2  gives  a  histogram  of  logf  from  a  simula¬ 
tion  experiment  with  1000  replications.  (Note  that  the 
matrix  decompositions  above  are  only  done  once!)  For  this 
simulation  it  was  assumed  that  0  was  known,  and  it  was 
taken  to  correspond  to  a  realistic  value  of  the  half  power 
point  of  500  km.  Data  was  simulated  for  611  Northern 
Hemisphere  radiosonde  stations  witha  true  r  of  2/3.  Per¬ 
centiles  of  the  distribution  are  given  on  the  plot,  and  one 
can  see  that,  under  the  ideal  conditions  of  the  simulation, 
one  could  reliably  detect  a  drop  in  r  from  the  nominal  2/3 
to  .43  or  less. 
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MLE  :  r  =  2/3  .  L  =  500 


-2  0  2 


I  oq  r  1  on  ?  / .’ 

Fig.  2,  Histogram  for  logr. 

We  remark  that  this  problem  of  estimating  relative 
accuracy  generalizes  to  indirectly  sensed  data  from  dif¬ 
ferent  types  of  instruments.  Let  /  be  some  meteorological 
field  of  interest,  and  suppose  we  have  two  different  sources 
of  data, 

and 

3,(2)^^(2)jr  +  e(2) 

where 

=  (yf’ . yZ)’ 

=  . lZ)- 

e(“>  =  (ei“’,...,el“’).  a  =1,2. 

If  /  were  the  three  dimensional  atmospheric  temperature 
distribution  one  source  could  possibly  be  the  important 
case  of  satellite  observed  radiances.  Suppose  that 

2 

e*“>=A/(0,o^a)Z^“^)  and  it  is  desired  to  estimate  r  = 

0(1) 

and,  possibly  ,  some  parameters  in  Z*“\(x=l,2.  To  proceed 
as  before,  we  need  the  existence  of  two  matrices 
B*“\a=l,2  of  dimension  /ix«(„),a=l,2  for  some 
sufficiently  large  n,  which  satisfy 

Let  be  defined  by 

w(“)  0=1,2 

and  let  ^  be  defined  by 

4  = 


The  covariance  matrix  of  %  is  then 
£44'  =  -t- 

Suppose  is  of  full  rank,  then  we  can  take  the 

Cholesky  decomposition  LL'  of  B^''2iB^*\  where  L  is 
lower  triangular,  and  let  z=L~^'t,.  Then  the  covariance 
matrix  of  z  is 

Ezz'  =  csl(rl+Q\ 

where  r=(s\l<s\  and  Q=L~^B^^^Q2B^^^L~^' .  The  ML  esti¬ 
mate  is  then  given  by  the  minimizer  of  M  and  the  estima- 
bility  of  r  depends  on  the  properties  of  Q.  Loosely  speak¬ 
ing,  the  two  vectors  and  need  have  to  have  their 
"energy"  at  different  "wavenumbers".  Questions  about  the 
existence  of  g(X)d,  or  consistent  estimates,  as  n— »«>  can  be 
approached  by  studying  the  properties  of  Q  from  the  point 
of  view  of  the  theory  of  equivalence  and  perpendicularity. 
See  Stein  (1988),Stein  (1987).  Wahba  (1988),  Wahba 
(March,  1987). 

2.  INTERACTION  SPLINES 

Interaction  splines  provide  a  tool  for  modelling  a 
response  which  may  depend  nonparametrically  on  d  vari¬ 
ables,  as  sums  of  (smooth)  functions  of  one  variable,  sums 

of  functions  of  two  variables,  and  so  forth.  Sums  of  func¬ 
tions  of  one  variable  arc  known  as  additive  models,  see,  for 
example  Friedman,  Crosse,  and  Stuetzle  (1983),  Stone 
(1985),  Hastie  and  Tibshirani  (1987).  Interaction  splines 
loosely  fall  into  two  types,  namely,  regression  splines, 
whereby  the  estimate  is  a  least  squares  regression  on  a  set 
of  basis  functions  (the  number  and  types  of  of  which  play 
the  role  of  the  smoothing  parameter(s)),  and  smoothing 
splines,  where  the  estimate  is  the  solution  of  a  penalized 
least  squares  problem  in  a  reproducing  kernel  hilbert  space 
with  an  appropriate  norm  or  seminorm.  (Such  estimates  are 
always  Bayes  estimates,  see  Kimeldorf  and  Wahba  (1971), 
Wahba  (1978).)  Hybrid  splines  result  when  one  solves  the 
penalized  least  squares  problem  in  a  finite  dimensional 
(approximating)  space  of  basis  functions.  In  this  case,  the 
muUiplier(s)  on  the  penalty(ies),  and  the  number  of  basis 
functions  may  both  act  as  smoothing  parameters.  J.  Fried¬ 
man,  (these  proceedings)  was  concerned  with  regression 
splines,  in  this  Section  we  are  concerned  with  smoothing 
splines.  Computing  the  GCV  estimate  for  the  number  of 
basis  functions  for  regression  splines  is  not  a  problem 
(altho  modifications  to  the  GCV  to  account  for  knot  selec¬ 
tion  raise  interesting  questions,  see  Friedman  (August 
1988),  Friedman  and  Silverman  (1987)).  The  computation 
of  the  GCV  function  in  the  smoothing  case  can  be  a  major 
numerical  challenge  when  there  are  large  data  sets  and 
multiple  smoothing  parameters,  and  heretofore  has  been  a 
deterrent  to  work  with  multiple  smoothing  parameters. 
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The  reproducing  kernel  (rk)  hilbert  space  that  we  and 
others  (  Barry  (1983),  Barry  (1986),  Wahba  (1986),  Gu  et 
al.  (June  1988)  have  proposed  as  the  natural  setting  for 
interaction  smoothing  splines,  is  the  tensor  product  of  d  one 
dimensional  rk  spaces  (see  Wahba  (1975)  for  an  older  work 
on  tensor  product  spaces).  We  remark  that  the  tensor  pro¬ 
duct  spline  spaces  are  qualitatively  different  that  the  spaces 
which  provide  the  setting  for  the  thin  plate  splines  (see 
Wahba  and  Wendelberger  (1980)  for  example).  Some 
remarks  contrasting  tensor  product  and  thin  plate  splines 
may  be  found  in  Wahba  (1986). 

In  the  remainder  of  this  paper,  we  describe  interaction 
splines,  and  show  how  the  algorithm  proposed  in  Gu  et  al. 
(June  1988)  can  be  used  to  choose  multiple  smoothing 
parameters  and  build  interaction  spline  models  via  the  use 
of  GCV. 

We  first  describe  an  abstract  result  concerning  the 
fitting  of  functions  with  different  smoothing  parameters 
associated  with  different  components  of  the  estimate.  The 
application  to  interaction  spline  models  will  then  be  fairly 
easy  and  will  be  described  next. 

Let  H  =  H  o®W  1  be  a  reproducing  kernel  Hilbert 
space  of  functions  of  at  =  (x  i , . . . ,  jtj)  where  Hq  is  of  finite 
dimension  M  and  H  i  is  the  direct  sum  of  p  orthogonal  sub¬ 
spaces//’,  . . . ,  H’’, 

=  (2.1) 

Suppose  we  wish  to  find /  e  //  to  minimize 

-!()'.•-/  +  X  i  0p‘  II P P/ II  ^  (2.2) 

"/=!  p=i 

where  . . .  ,Xn(i)),  0i  =  1  and  is  the 

onhogonal  projector  in  H  onto  W®.  (The  will  later  be 
various  subspaces  for  main  effects,  two  factor  interactions, 
etc.)  If  the  rk  for  with  squared  norm  HZ’ 1^/1^  is 
Q^(  ;  ),  then  the  rk  for  //'  with  squared  norm 

2:03'  I/’ 3/11  Ms 

p=i 

iepQ'*(;)  =  Ge(:).  (2.3) 

^=1 

say.  The  following  facts  are  well  known;  (Kimeldorf  and 

Wahba  (1971),  Wahba  (1978))  Let  Oi . span  Hq 

a.id  suppose  the  design  points  x(l), . . .  ,x(n)  are  such  that 
least  squares  regression  in  Hq  is  unique.  Then  the  minim- 
izer/x.0  of  (2.2)  is  defined  by 

fKoM=  Z^v<>v(Jc)+ 

V=1  1=1 

where  . . .  and  c  =(c  j . c„)  satisfy 


(Qe  +  nXl)c  +Sd=y 


S'c=0 


where  Qq  is  the  nxn  matrix  with  ijth  entry  Qe(x{i) ;  x(J)) 
and  S  is  the  nxM  matrix  with  ivth  entry  (^(x(.i)).  Letting 
the  Q-R  decomposition  of  5  be 

S  =  (F,:F2)[1] 

a  series  of  standard  calculations  (see  e.  g.  Wahba  (1985)) 
gives  that  the  influence  matrix  A  (A.,0)  for  this  problem  is 

I  -  A  (A.,0)  =  nXF^iFiQQFi  nUy^Fi 


and  the  GCV  function  becomes 

z\nXI  4-  l0)“^z 
V(A.,0)  =  — ^ 

(fr(rtX/  +Ie)  )^ 


(2.4) 


where 


and 


z=F2’y 


Ze  =  £’  -h  02 +  ■  ■  ■  +  QpIP 


with 

1.^=F2'Q^F2 

and  is  the  nxn  matrix  with  iJth  entry  Q^(x(i)  ;x(J)). 


The  algorithm  for  minimizing  V(X,d)  suggested  in  Gu 
et  al.  (June  1988)  begins  s^  ith  steps  i)  and  ii)  as  in  Section 
1.  The  numerator  of  V  is  obtained  by  backsubstitution,  and 
to  calculate  the  denominator  we  need  to  calculate 
fr(C"'C"'').  Denote  the  i-th  row  of  C'’  by  c/.  We  have 
tr(C-’  C"'')  =  2;  From 


■fli 

b,  a2 


C  C'=(  cj,  C2,  ■  ■ . ,  c„) 


we  have 


rtiC,  =  e,  -  +  i ,  i=n-l,..,l 

where  e,’s  are  unit  vectors.  Because  C”' '  is  lower  triangu¬ 
lar,  c,+i  is  onhogonal  to  e,.  Thus  we  have  the  recursive 
formula 

»C„  11^  =£2“^ 

I  II  ^  =  (  1  -t-  bfl  c,  .,1  II  ^  )  ,  i  =  n  - 1 . 1 

which  can  be  calculated  in  O  (n)  flops. 
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We  now  describe  how  one  obtains  interaction  spline 
models.  Let  Wf  be  the  Sobolev  space 

abs.  com..  eLiIO,  lU 

with  the  squared  norm 

m-l  ^ 

ll/I  ^  =  i  (/?v/)^  + 

v=0  0 

where 

1 

R^f  =  \f'''Hx)dx,  v  =  0,  1, . . . ,  m-l. 

0 

Let  ki(x)  =  Bi(x)/l\,  where  B/  is  the  /-th  Bernoulli  polyno¬ 
mial  (Abromowitz  and  Stegun  (1965)),  we  have 
ByS;  =  5y_/  where  5,  =  1,  i=0,  and  0  otherwise.  With  this 
norm,  W”  can  be  decomposed  as  the  direct  sum  of  m 
orthogonal  one-dimensional  subspaces  (I:;), 
/  =  0,  1, . . . ,  m-l,  where  {A:;}  is  the  one-dimensional  sub¬ 
space  spanned  by  ki,  and  which  is  the  subspace  (orthog¬ 

onal  to  Yj®{k0)  satisfying  Bv/=0,  v  =  0,  l,...,m-l, 
that  is, 

W^={ko}  ®  {kj  ©  •  •  •  ©  ©  H.. 

This  construction  can  be  found  in  e.g.  Craven  and 

d 

Wahba(1979).  Letting  ®  Wf  be  the  tensor  product  of  WJ 
with  itself  d  times,  we  have 

d  d 

®  W?  =  ®  [{ko}  ©  •  ■  •  ©  {k„.J  ©  H.] 

d 

and  ®  W2  may  be  decomposed  into  the  direct  sum  of 
(m  +  if  fundamental  subspaces,  each  of  the  form 

[  ]®[  '  ]  {d  boxes  ) 

where  each  box  (  [  ]  )  is  filled  with  either  {/(/)  for  some  /, 
or  Hm. 


The  subspace  of  ©W"  for  additive  splines  is  the 
direct  sum  of  all  fundamental  subspaces  for  which  at  most 
one  of  the  boxes  is  filled  with  a  symbol  other  than  {koJ.  In 
the  additive  model  f(X],  •  ■  ■  .x^)  is  of  the  form 

d 

/(-■fl . •^a)  =  )i+  ISa(.ta) 

0U=J 

where  e  {k\}®  ■  ■  ■  ®(km-\}®H>  and  the  penalty  term 
in  (2.2)  can  be  taken  as 


d  > 

I  e«  / 

I  0 


2 

dXa- 


Then  we  have  the  identifications:  Hq  is  the  direct  sum  of 
the  fundamental  spaces  with  all  /to’s  in  the  boxes  except  at 
most  a  single  ki  with  /  not  equal  to  0,  and  the  W®‘s  are  the 
fundamental  spaces  with  all  k^)\  in  the  boxes  except 
exactly  one  with  //.,  The  subspacc  for  (all)  two  factor 
interactions  is  the  direct  sum  of  all  fundamental  subspaccs 


for  which  exactly  2  boxes  are  filled  with  a  symbol  other 
than  k(j,  etc.  For  m=l,  there  is  only  one  kind  of  2  factor 
interaction  fundamental  subspace,  it  has  2  H,'s  in  the 
boxes  and  d-1  k^'s,  but  for  m'2.2,  there  are  subspaces  with 
2  H*'%,  with  1  H*  and  one  ki,l>0,  and  with  2  elements  of 
the  form  ki,l>0,  we  shall  call  these  pure,  mixed,  and 
parametric  subspaces,  respectively.  Of  course  the  possibili¬ 
ties  multiply  if  one  wishes  to  consider  higher  order 
interactions. 

The  form  of  the  induced  norms  on  the  various  sub¬ 
spaces  can  most  easily  be  seen  by  an  example.  Suppose 
d  =  4  and  consider  for  example  the  subspace 

[{k,}]  ®  [W.]  ®  [H.]  ®  Ukr}], 

which  we  will  assign  the  index  l**r.  Then  the  square  norm 
4 

of  the  projection  of /in  ®  WJ"  onto  this  subspace  is 
IIP;.V/II^  = 

11  g2m  11 

11  Ti;rrirll^'u.)^'’(x.)/(^i.j^2.ji3.x4)tixidx4  dxidx^, 
00  0X2  0X3  00 

where  means  B*  applied  to  what  follows  as  a  func¬ 
tion  of  Xq.  Using  the  fact  that  the  reproducing  kernel  for 
(<:/)  is  ki(x)ki(x')  and  the  rk  for  W.  is  C(x  ;  x')  given  by 

Q(x  ;  x')  =  kn(x)k„(x')  -k2mi\x-x’]) 

where  [«]  is  the  fractional  pan  of  u  (see  Craven  and 
Wahba(1979)),  it  is  easy  to  see  that  the  r.k.  for  this  sub¬ 
space,  call  it 

2;.v(Xi  ,X2,X3,X4;  x'j  .x' 2.x’ i.x' i)  =  Ql>»r(X\x'). 
is 

2,.v(x  ;  x')  = 

ki(X\)ki(x\)Q(X2  ;x'2)Q(X3  ;  x'3)<:,(X4)/:,(x'4). 

The  rk  for  the  direct  sum  of  any  number  of  fundamental 
subspaces  is  the  sum  of  the  rk’s,  since  these  fundamental 
subspaces  are  all  orthogonal. 

We  now  have  a  very  flexible  model  building  tool,  by 
constructing  models  based  on  subspaces  of  interest.  To 
discuss  some  of  the  possibilities  in  a  simple  way,  let  us  first 
restrict  ourselves  to  the  case  m=l,  and  consider  only  main 
effects  and  two  factor  interactions.  Here  //q  is  Just  the 
space  of  constants,  there  are  d  one  factor  subspaces,  one  for 
each  variable,  each  of  which  is  a  fundamental  space  with 
all  Uni’s  in  the  boxes  except  one  //.,  and  d  Ul-\)'2  two 
factor  spaces.  Assigning  each  space  its  own  83  (more  pre¬ 
cisely  A.63’)  and  trying  to  estimate  all  these  parameters, 
appears  uicky  even  for  small  d,  and,  in  fact  we  would  like 
to  eliminate  interaction  terms,  and  even  main  effect  terms. 
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if  they  are  not  supported  by  the  data.  My  students  C.  Gu 
and  Z.  Chen  have  been  investigating  various  philosophical 
and  numerical  strategies  for  deciding  to  eliminate  or  add 
subspaces.  The  basic  tool  is  the  use  of  the  algorithm  in  Gu 
et  al.  (June  1988),  where  we  find  that  it  is  quite  feasible  to 
optimize  V  (X,9)  for  0  with  one,  two  or  possibly  three  com¬ 
ponents,  with  n  of  the  order  of  hundreds.  Therefore,  to  use 
the  algorithm,  we  combine  subspaces.  Here,  if  we  lump  all 
additive  spaces  into  one  subspace  (by  taking  the  direct 
sum),  and  all  interaction  spaces  into  another,  then  there  is 
only  1  component  in  0.  The  rk  Qe  of  (2.3)  is  constructed 
and  X.  and  0  estimated  by  minimizing  V  of  (2.4).  If  the 
estimated  0  is  su  ficiently  small,  then  the  interaction  spaces 
can  be  deleted.  Various  strategies  for  deciding  what  consti¬ 
tutes  "sufficiently  small"  are  being  investigated.  Other  stra¬ 
tegies  consist  of  deleting  individual  interaction  terms,  exa¬ 
mining  main  effects  terms  whose  interactions  have  been 
deleted,  etc.  Proceeding  to  the  m=2  case,  possible  stra¬ 
tegies  multiply  quickly,  however,  preliminary  fitting  with 
m=l  can  act  as  a  screening  tool  (C.  Gu,  personal  communi¬ 
cation).  We  remark,  at  this  point,  that  if  one  believes  that 
the  function  one  is  trying  to  estimate  is  a  sample  function 
from  the  prior  associated  with  the  penalized  least  squares 
problem  then  the  GML  estimates  of  X  and  0  are  the  minim- 
izers  of 


M(X,0) 


z\n\I  +  Ze)'‘z 

T 

[der(nXy +  19)-']" 
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COMPUTING  EMPIRICAL  LIKELIHOODS 
Art  OwiMi,  Stanford  I'tiivorsitv 


Abstract 

Fhi'  oinpirical  distribution  function  of  a  sample  is  often 
presented  as  a  nonparametric  maximum  likelihood  estimate 
of  tiie  sampling  distribution.  The  likelihood  funrtion  it  ma.x- 
imixes  can  be  u.sed  to  define  a  likeJiliood  ratio  function.  This 
empirical  likelihood  ratio  funrtion  has  some  of  the  properties 
of  parametric  likelihood  ratio  functions:  in  particular  a  non¬ 
parametric  version  of  Wilks’s  (lOitS)  theorem  holds. 

Like  the  bootstrap,  empirical  likelihood  allows  the  siatis. 
tician  to  substitute  computer  power  for  distributional  as- 
suu'ptions.  1  he  methods  differ  in  that  the  bootstrap  uses 
.Monte  Carlo  samplitig  while  empirical  likelihood  performs  a 
number  of  ntimerical  optimizations. 

Phis  paper  describes  how  the  numerical  optimi/.aiion  re 
quired  by  empirical  likelihood  may  be  performed.  The  focus 
is  on  confidence  regions  for  multivariate  means,  with  e.xten 
sions  to  statistics  that  are  smooth  functions  of  means,  for  a 
multivariate  mean,  the  opliini/alion  proltlent  is  convex  and  so 
t  here  are  optimizat  ion  met  hods  guaranteiul  to  find  the  unitpie 
global  optimum  from  any  starting  point. 

One  by-product  is  an  algorithm  for  determinitig  whethi-r 
.1  poitit  in  euclidean  space  is  within  the  convex  hull  of  a  given 
set  of  points. 

Key  Ufirt/s  a/iif  f'/irases.  Hoot st  r;ij>.  confiilence  set .  coii- 
^('X  duality,  eitipirictd  like|ibood.  likelihood  rtitio  test,  ticiti 
ptirametric  likelihood, 

1.  Introduction,  l  et  .  Vj.  .  . .  be  indepemlent  ran 
ifoni  \ectors  in  //f'' .  for  /i  ■  (,  with  common  i/i'triliuiion 
Innction  /-i,.  1  he  ern pirii  a|  disi  rilnit ion 
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i'  '>'.e|]  siio’.i.n  to  i,e  till'  ne,n par.imei rn  tnaxitnitin  likelihood 
‘  - 1  i  In  a  t  e  oi  /  1 1  1 1,1  sed  oil  \  I . \  .  I  lore  (!•_.  denot  es  a  poi  n t 

1  he  iil.eli  ho.  id  iiinct  ion  t  hilt  / .  ina':inii/ex  I- 

/  ‘  )  n '  ^ '  i 

t 

/  rv  f  ’  ii'-  pr-  '''.li.ilit t.|  j  A ,  ;■  ii:,-:-’!  / 

-p.,t  t  !((■  H  ,ii  hkf  iilioft'l  raiu. 

f'l  III  I  ion 

/I'l  /  I  /  I  /  I  '7  I  /  .  I 

i.iii  be  to  (ofi-iiuet  non  pa  t. tile'*  t  n  lonleeitie  reaioio 

.Mel  1  I  I  ■-  lit  /  h..  ,1  'I.ili  tn.d  tuleltoll.il  (o-In  .1  el  ,,f 

•  I  e  f  I  Ml  n  I  It  -  .  -  Ii  Ih'  .  I  .1  k  Ml  e  '..del  -  Ml  Hi  '  .  .1  ltd  '  <  -  id*  I  -  . 

of  li.e  he  e; 

s  ’ll.  H.  /  I  ■  I  1 

I  i'del  nill.l  I  oielM  lot  -  .1  e  ni.ii  l.e  I,  .1  .!  .1  a  I  ..nlelei;,  , 

reL'loIM  lot  /  7  ,  ■  lie  .■■■••■'.II..  •  O  ....  I  -  ....  !  .  , 

ioM  I  e  !  .  I  1.  ,■  ,.|  /  M-  11.  :  i.., 

IM.  |M . ’.hi.  ■  I  I 

I  ie  I  eet  I  ,||  o  '  I,,  e  ■  .,■  Ill  \  I  -Mi  .  . 

.  tie  t,.,:,  .e,  /  ,,o  I  ,  s  //,• 

Ill  I  e  I  , ,  I  ■  ,  e  I  /  ■  ■'  ;  ,  /  t  ,  a  til  .1 , 1  •  .  .ne  ■  ■  .  i . 
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restrict  attention  to  distributions  with  support  in  the  sample, 
that  is,  to  distributions  F  F^.  This  is  convenient  because 
tlve  statistician  might  not  be  willing  to  specify  a  bftunded 
support  for  /■  ,  and  because  it  reduces  the  construction  of  .S' 
to  a  finite  dimetisionaj  problem.  Owen  (  I9s7)  proves: 

Theorem  1.  Let  .\.  .Vi,  .X'j.  •  ■  •  he  i.i.il.  random  vectors 
in  U!’’.  with  F(.\  )  =  /iq.  and  vari  )  =  IJl  of  rank  q  >  0. 
For  positive  c  <  1  let  .S',  „  =  {/  .Vr/F  j  /!(/■  )  >  c.  F  «  /■'„}. 
7'fieri  .S',,„  is  a  convex  set  and 

^  ■'’'■■"I  =  <  -‘i^loSf)- 

.Moreover  if  F  (|).V)|'')  <  sc  then 

M’l/'o  e  -S'en)  -  <  -'Hogc)!  =  O(n-'F-). 

This  is  a  nonparametric  version  of  Wilks's  then 

rein.  The  rate  attained  is  also  the  same  as  Wilks  finds.  ItiCi 
ccio  Hall  atid  liomatio  (  I'tSK)  show  that  wheti  F(i[.V|i'7  <  x 
the  convergetice  in  theorem  1  is  at  rate  n"'. 

The  cornput at iontd  jiroblem  that  arises  is  the  compuia- 
tioti  of  the  empirical  profile  likelihood  ratio  function 

r(//)  =  sup{  Ft  F)  I  ^/f/F  =  p.  F  F.,}  (l,li 

for  various  catididates  //  for  Ft.V).  Tests  of  F(  .V  )  =  p  are 
rejeiied  if  r(/()  is  small  atid  confidence  regions  are  formed 
frotu  the  unrejected  values  of  It  is  convenient  to  [>lot  rip) 
w  hen  p  <  '2. 

Owen  (iPsT)  shows  that  for  pn  -  F(.V) 

-2  log  r|  Pi,  )  -  7'-'  -i-  n  "  ' '  ) 
where  /'  is  Hot  elling  s  St  at  ist  ic.  1  his  suggests  referriiig  to 
(n  -  l)(//l,n  -  <1)1  i:-!  of  the  ('his<|u.'i ru  limit.  IIk' 

n  trnn  i>i  :t  turm.  lluil  cxtf'iHi^  lliu  t  niiliiif'ticc 

ill  (iiiuft ion.''  of  posiii\T‘  ^k‘'wn<‘ss. 

1  Ih'  coniiuff ioiis  l)ftw»M>ii  uiiipirii'.il  lik''lilit>n(l  ;iinl  iithur 
utirk.  f“>p‘M  oiiiy  ilif  l>f  )t )!  vt !  „  p.  iv  oiii  !i  ii*m1  ill  Out-n  (ipsTj. 
from  wliM  h  most  of  i|ii>,  aiiicP'  iv  liik'Mi, 
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Tho  Miiiximizing  weights  arc  given  liy 


w,  = 


n  i  +  A'(.V.  -  ,i) 


(2.;t) 


so  that 


’•(/<) = n* ' + 


For  p  =  1  it  is  easy  to  solve  (it. 2)  with  a  safeguarded 
zero  finding  algorithm  such  as  ftreiit's  method  (Fress  el  al. 
19S6).  Owen  (ItfS.S)  uses  lireni’s  method  to  maximize  empir¬ 
ical  likelihood  ratios  for  certain  .\/-estimales.  The  hiserliim 
algorithm,  which  safeguarded  zero  finders  use,  does  not  exist 
for  7r  >  1,  so  we  reformulate  the  prohlem. 

It  is  more  convenient  to  consider  finding  a  zero  of  -p(  A). 
Inspection  of  — p  shows  that  it  is  the  gradient  with  respect  to 

A  of 

/(A)  =  -  ^log(l  -(-  A'(.V,  -  p))  (2.D 


SO  a  /♦TO  of  (f/  is  a  critical  jioint  of  /. 

\\('  now  arRue  that  /  is  <'oiiv<'x  ovrr  a  coifvi'x  (ioinain. 
Sinro  only  n*‘0(l  to  consi(I(»r  A  for  wliich  all  »/',  <  L  w^‘ 
rnav  ai^suriH*  that 


1  +  A'(.V,  -  //)  >  1/n 


0, 


I  <  j  <  n 


(2.0) 


xo  A  may  ho  ronfim'<l  to  tlio  intprsortion  !)  of  half  sparos 
rjpfifKvl  hv  (2.o).  !)  is  rorjw‘x.  Tho  Hps.sian  of  /  is 


//tA). 


-V,  -  f/ )( .V,  ft y 


ll  +  A'{.V.  -  //)]- 


whi<"h  is  positiv/'  sp/nirioliniJo  on  D  and  In’nrp  /  is  jojjvpx. 
Whon  thr  sampln  variaiicp  of  tin'  .V,  is  of  full  rank.  II  is 
positive  (Ipfiriitp  on  I).  \Vp  assniii''  from  injw  on  that  //  is 
po^ilivp  (iflinitp  on  /).  and  honco  t  hat  /  is  strictly  convpx  on 
I).  (i  follows  that  tho  solution  of  (2.2)  is  ih<‘  nniepn*  t’lohal 
mitiiinnm  of  /  on  1). 

W'c  now  have  tlip  followinp;  dual  prohh'in:  tf>  maximi/o 
12.1  1  ovrr  i  in'  iini i  slm{)lox  siih j“rt  to  /»  <  ^lIl,^t  raiiti s  is  to  mm- 
imi/,»'  /  ov<'r  I)  without  constraints.  1  In'  lirst  prohlnm  is  iti 
tho  n  -  1  indopmuhnit  variables  of  the  simph'X  and  the  se< 
find  is  in  the  jt  compfiiients  of  A.  so  that  the  unconstrained 
final  prfililern  is  );enerally  of  mn(  h  smaller  diiin'iisioii  than  llie 
oiit;inal  cfjiist  rai  n<*d  prohleni. 

Not  icf*  that  /( A  )  is  the  lo«;  li  kel  ihoful  rat  io  funct  *mui  (2.1  ) 
with  (2..d)  substituted  for  //■,.  lnt<‘reoinq,ly,  this  makes  the 
final  jirfiblem  fine  fif  minifiinni  likelihood,  f  or  values  of  A  c.  /> 
otlif'F  than  the  sfilutioti  of  '2.2).  the  it\  in  (2.2)  need  no!  sum 
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local  conver^enco  results  that  j'uarantr'e  convorp,(‘nce  to  a  rel¬ 
ative  miniiimm  provided  the*  starling  l>oint  is  sufliriont ly  close 
to  the  solution.  The  problem  at  hand,  the  minimization  of 
a  convex  fmiclion  over  a  convex  domain  is  known  as  convex 
programming,  for  a  discussion  of  convex  [)rograiMniirig  se(‘ 
l^sheiiichny  and  Danilin  (197S,  (’hapter  .2). 

The  convergence  theorems  describe  the  performance  of 
the  algorithms  when  computations  are  made  with  infinite  pri'- 
cisioii,  and  infinite  serpiences  of  steps  are  carried  out.  In 
practice  one  has  to  contmitl  with  finite  ajiiiroximations  on 
botli  issues,  ll  has  been  th('  author's  experience  that  the 
computations  are  most  easily  mad(‘  for  ft  near  .V  and  that 
as  fi  approacfies  the  convex  hull  of  the  data  tfie  computa¬ 
tion  hecfunes  more  difficult.  .Algorithms  may  theri'fore  he 
conrpared  on  the  basis  of  how  small  the  log  likelihood  ratio 
must  become  before  the  algorithm  encounters  diflicnlty.  .A 
natural  goal  for  compulation  is  to  he  able  to  compute  t  lie 
lug  likelihood  ratio  dowui  to  valties  rf)rrespoii(ling  to  confi¬ 
dence  intervals  with  cov(‘rage  well  bevamd  that  rc'tjuired  in 
practice.  For  othe.'-  values  of  //  the  ajiproxiination  ;*{//)  =;  0  is 
adequate.  In  the  example  of  .S<'ction  if.  tin'  IMSI.  conjugate 
gradient  routine  ZXMCllt  a[)pli('d  to  /  extemleil  via  log"  al 
lows  computation  of  log  likelihoods  smaller  tliaii  -50  which 
far  exceeds  the  needs  of  any  rOfisonahle  confidence  regions  for 
tlio  mean. 

Wlien  /i  is  outside  of  the  convex  hull  of  the  data,  tlu're 
is  no  solution.  In  i)rarlire  what  happens  is  that  1 1n*  algoril  lim 
terminates  at  a  large  value  of  A  for  which  the  slopj*  of  tfie 
togaritiun  is  so  small  that  the  gradient  is  zero  to  the  required 
precision.  One  can  tell  that  this  has  happened  because  the 
1",  will  no  longer  sum  to  1.  It  may  bf*  more’  (  onva’iiienl  to  us(* 
this  fact  than  to  check  whetlu'r  a  given  jxunt  is  within  the 
(onvex  hub  of  (lie  data,  especially  w  hen  f|ie  dimensiem  of  t}i< 
data  is  higlier  i  han  2. 

3.  Example.  For  an  illustraiiun  w(«  use  some  data 
from  Far.sen  and  .Marx  (Ib-'^ti.  p.  MO).  Kleven  male  ducks, 
each  a  second  generation  cross  hetvvefui  mallard  and  [linfail. 
were  examiiM'd.  I'iieir  plumage  was  rated  on  a  s(  ale  from 
0  (com)>lei«*lv  mallardlike)  to  20  ( comp!et.»‘ly  piniaillike)  and 
1  lu'ir  hehavior  was  similarly  rated  on  a  scale  from  0  ( mallard  ) 
to  1'  (pintail),  figure  1  shows  th<*sf'  data,  togeth*'!  with 
m-ste<I  empiri<al  likelDiood  <onfid<’rice  contours  for  the  nierOr, 
t  he  point  wit  h  plumage-  |  1  and  Ix'ha vii  »r  -  1  1  plot  t  (uj  with 
a  <ircle  of  tvvic<’  the  ar<’a  of  the  others,  hecaus.'  it  represents 
I  wo  (lucks.  I  he  confi<h’n<e  cont  out  a  re  prf'st'nt  ed  for  nonii  n.d 
( onfidetn  (•  levels:  ..'lO,  **0,  .u.'i,  .qq,  taken  from  2!)/q  time-  the 
l\>  <,  di-'t ribut jon.  .\n  asleri'k  marks  the  sample  dkmj). 

Die  dual  problem  IroMi  Section  2  was  solvecl  .it  e.t«  h 
point  in  a  100  bv  100  grid,  using  the  IMSl  ( on  jticg.ii  e  gr.tdi 
enl  rout  me  /.X( ;  n  .  Of  the  10000  poi  nt  s  a  pproxi  iiiat  d  y  ii'//' 
of  them  were  withiti  the  (oiive.x  hull  of  the  point  (loud.  1  1m‘ 
comffutation  .started  at  the  (Kunf  ueare.st  tlie  sample  mean, 
and  |)roceede(l  in  a  discreie  <  oimterc  hn  kvMsi-  spu.il  for  (-ac  h 
jxuiit  after  t  he  first .  I  lie  st  at  1 1  ng  v.tl'M  of  \  j  ,|  i  he  o))!  i  iii  i/.it  ion 
was  taken  fr«»m  one  <if  the  j.eiehbonnL-,  points  m  the  urld  hu 
will'll  the  enipirieal  likelihood  h.ld  .drea.h  )..el;  -  ''M-pi-o-.j. 

I  l;e  fx  atj  t  chosen  w  a-*  the  I  j  l  h  the  )i  nde-''!  eMi  pi  in  a  1  like 

lihood.  If  all  sin  h  poin*--  !..id  .iii  empiiit.d  |ikelihoo<l  r.iiio 
near  /er(»  ill*'  point  w .o-  given  an  en. ini  n  .d  li k-  1 1 hooi  1  i  at  lo  ot 
/ero  I  hi-  sav  es  <  O'l'-  id'Ta  i  de  1  Mile  oil  t  -  id'  •  t  li<  i  <  m  \  i  X  h  11  i !  ‘  '! 
the  'lat  a.  I  lie  I  oniim  t  at  ion  t  ■  ><  ik  .i  ppi  <  >  \  i  n  i  a  t  >  1  \  .(  n  1 1  ii  m  '  on 

a  \  a\'>t  a  t  jofj  11  '  Mill  /I>V,|  \  ‘  u  r  a  k  -•*  a  t  )of) 

III  I  iL'uo-  J.  the  -ame  in  (' ’t  n  1 .1 1  lo'i  i-  pre-f-ni  .-.1 .  with 
t  he  (  ont •  111  r  iiovv  i  .i ken  (tore  .i  • '  ah-d  /  .  r 1 1  - 1 1 )  i i ii i  lot i  |c  a 
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Figure  1;  Empirical  Likelihood  Contours 
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tio  contours  assuming  a  bivariate  normal  distribution  with 
unknown  mean  and  variance.  The  50%  region  is  nearly  the 
same  as  for  the  empirical  likelihood.  More  extreme  regions 
remain  elliptical  while  those  of  the  empirical  likelihood  ratio 
method  tend  to  the  convex  hull  of  the  data. 

4.  Extensions  to  Other  Statistics.  Theorem  1 
for  means  extends  by  delta  method  arguments  to  statistics 
that  are  smooth  functions  of  means.  See  theorem  2  of  Owen 
(1987).  Examples  include  the  variance  of  .V  which  is  a  func¬ 
tion  of  the  mean  of  and  the  correlation  befwwn  X 

and  Y  which  is  a  function  of  the  mean  of  (A,  V,  .V",  V  ", 
Similarly  coefficients  of  skewness,  partial  correlation  and  re¬ 
gression  can  be  treated  this  way.  DiCiccio,  Hall  and  Romano 
(1988)  show  that  the  coverage  error  is  of  order  n"'  in  this  rase 
under  mild  moment  and  derivative  conditions.  Extensions 
to  A/  estimates  and  to  F'rechet  differentiable  statistical  func¬ 
tionals  are  made  in  Owen  (1987).  .loint  confidence  regions 
for  p  such  statistics  ran  be  based  on  a  chisquare  limit  with  p 
degrees  of  freedom  provided  there  are  no  linear  dependencies 
among  the  statistics. 

Consider  the  variance.  The  algorithm  of  Section  2  can  be 
used  to  compute  the  empirical  likelihood  of  the  pair  (;/.;/'  + 
rr^)  as  a  candidate  for  the  mean  of  (.V,.V^).  lor  any  fixed  rr 
the  resulting  likelihood  may  be  maximized  over  //.  Equiva¬ 
lently,  we  may  compute  the  likelihood  of  (/i.rr-)  for  the  mean 
of  (.\',(.Y  -  /<)■*)  •‘■td  take  as  the  likelihood  of  rr  the  maxi¬ 
mum  over  p.  The  latter  prescription  .shouhl  be  more  stable 
numerically.  Let 

T[p,a)  =  sup  tile, 
sitbjert  to  the  constraints 

I/’,  >  0.  ^  w,  =  1.  ^  U'..\',  =  /(,  ^  >‘\{X,-lt)~  =  rr- 

and  abuse  iifttation  by  letting 

T{n)  =  supr(/(.a). 

The  analysis  in  Section  2  allows  us  to  write  ic,  =  «',(A)  where 
A  €  IK'  is  the  Lagrange  multiplier  and  nf)w 

iogr(iT)  =  su|)  inf  log  r'((T. /(,  A ) 

where 

r'irr,  11.  X)  =  ( 1  +  Ajl  .V,  -  II )  +  A_,((  ,V,  --//)'  -  rr'  ))  '  . 

Figure-  3:  Pot  a  ss  1.  uni- A  r  got)  Dates 
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A  nested  optimization  may  be  used  to  compute  t(fT). 
The  inner  level  of  the  optimization  minimizes  the  likelihood 
over  A  with  p  and  o  held  fixed.  The  outer  level  maximizes 
the  resulting  minimum  over  p  for  fixed  a.  In  the  outer  level 
it  is  convenient  to  know  the  derivative  of  the  minimum  with 
respect  to /r.  This  tnay  be  determined  analytically, 

^inflogrV,/r.  A)  =  tiA,  (-1.!) 

where  Aj  is  the  first  component  of  the  minimizing  A.  K 
generic  function  optimizer  may  try  impossible  values  of  (p,  a) 
before  finding  the  optimum.  This  is  especially  likely  for  ex¬ 
tremely  large  or  small  values  of  a.  It  thus  helps  to  extend 
the  domain  of  the  empirical  likelihood  as  described  in  Section 
2,  through  the  function  log',  rite  alternative  is  to  design  an 
optimization  method  more  specific  to  empirical  likelihood 
When  the  likelihood  is  extended,  the  more  general  version  of 
(4.1)  is 

—  inflogr'(fT./t,  A)  =  uA,  ^  tc.(A|. 

Larsen  and  Marx  ( I98fi,  p.  3.32)  give  19  estimated  ages, 
in  millions  of  years,  of  mineral  samples  collected  in  the  Hlack 
Eore.st.  The  ages  were  estimated  by  I’otassinm-.Argon  dating. 
The  variance  of  these  measuremetits  is  of  direct  interest  sitice 
it  provides  information  on  the  precision  of  the  dating  method. 
A  histogram  of  this  data  d|)pear.s  in  Figure  3.  The  .samjih- 
.standard  deviation  is  27.1  million  years,  and  a  normal  theory 
9.5%.  confidence  interval  is  20.1  to  40  million  years. 

Figure  4;  Likelihood  Ratios 
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The  empirical  likelihood  ratio  was  calculated  for  values 
of  the  variance  corresponding  to  standard  deviations  in  the 
range  from  1.5  to  51.5  million  years  in  steps  of  half  a  million 
years.  Computations  were  made  for  an  increarsing  sequence 
of  standard  deviations  starting  near  the  maximum  likelihood 
estimate,  and  for  a  decreasing  sequence  starting  there.  This 
way  the  final  values  from  each  step  could  be  u.sed  as  starting 
values  for  the  next.  It  took  two  minutes  to  make  102  likeli¬ 
hood  evaluations  on  a  microvax  VaxStation  II. 

Figure  4  shows  the  empirical  likelihood  ratio  function, 
together  with  the  normal  theory  likelihood  ratio  function. 
The  likelihood  ratios  are  plotted  against  standard  deviations 
in  millions  of  years.  The  horizontal  lines  correspond  to  90% 
and  9591  empirical  likelihood  confidence  intervals.  Slightly 
different  lines  would  be  appropriate  for  the  exact  confidence 
regions  based  on  a  normal  model.  The  empirical  likelihood 


ables. 

Larsen  and  Marx  (19S6  page  45(i)  give  15  pairs  of  ob¬ 
servations  relating  the  frequency  with  which  crickets  chirp  to 
the  temperature.  The  data  are  plotted  in  Figure  6.  The  fre¬ 
quency  is  measured  in  chirps  per  minute  and  the  temperature 
is  in  degrees  Fahrenheit.  The  sample  correlation  is  0.835.  The 
empirical  likelihood  ratio  function  is  plotted  in  Figure  7.  Also 
shown  is  the  normal  theory  profile  likelihood  ratio  function. 
The  empirical  curve  lies  above  the  normal  one.  Figure  8  is 
a  histogram  of  1000  bootstrap  re|>lirations  of  the  correlation. 
The  empirical  likelihood  ratio  curve  is  very  asymmetric,  so 
it  will  yield  inferences  quite  different  from  those  based  on  an 
estimated  standard  deviation  for  p.  The  shape  of  the  curve 
is  similar  to  the  bootstrap  histogram. 

Figure  6:  Temperature  vs.  Chirps/Minute 
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ratio  curve  has  a  shorter  right  tail  and  a  very  slightly  longer 
left  tail  than  the  normal  one.  It  is  surprising  how  clo.se  the 
two  curves  are.  The  shorter  right  tail  of  the  empirical  curve 
seems  natural  given  the  apparent  shortness  of  the  tails  in 
Figure  .3.  The  sample  kurtosis  is  0.02  if  one  uses  the  nor¬ 
mal  maximum  likelihood  estimate  of  ij~  as  in  Miller  (19.8(). 
p.  272),  and  —0.29  if  one  uses  the  unbiased  estimate  of  . 
Since  the  sample  maximum  is  .344  and  tho  minimum  is  243. 
the  largest  possible  standard  deviation  for  a  reweighted  ,sam 
j)le  is  .51.5.  The  algorithm  found  an  empirical  log  likelihoo<l 
of  —52.9  for  a  standard  deviation  of  51.  The  smallest  stan- 
dtird  deviation  for  which  a  im'auingful  solution  was  obtained 
was  .3.5  ami  the  corresponding  empirical  log  likelihood  was 
-51.8.  These  correspond  trt  |)ulative  values  in  excess  of 
100.  It  follows  that  for  any  confidenc'  level  of  practical  inter¬ 
est  the  empirical  interval  for  the  variance  can  be  computed 
from  this  dtita.  For  standard  deviations  outside  (3.5.51)  the 
modilicatiotis  to  the  logarithm  that  make  it  possible  to  use 
generic  optimizers  lead  to  convergence  to  solutions  for  which 
tl]i'  weights  ic,  sum  to  less  than  I,  It  made  lh<' compillalii.ns 
more  stable  to  divide  t  hi'  ages  by  100  before  computing  the 
intervals. 

I'he  normal  theory  curve  is  extict  if  the  observations  are 
tiorrmilly  distributed  and  has  a  large  sample  justification  if 
the  kurtosis  of  the  metisuremetits  is  0.  The  empirical  likeli. 
hood  curve  has  a  large  sample  Justification  provided  that  the 
kurtosis  is  finite.  Figure  5  shows  a  hisiogram  of  1000  Iroot 
strap  rejrlicat ions  of  the  standard  deviation.  1  he  histogram 
has  .1  location  and  si  tile  comparable  to  those  of  the  likeli 
hood  rtitio  curves,  l  lie  right  tttil  of  the  bootsti.ip  histogrtiiii 
looks  tiiori'  like  the  I’liipiriitil  likelihood  ratio  curve  than  the 
normal  theory  one. 

.A  similar  nested  algorithm  works  for  the  correlation  />, 
1  he  inner  level  consists  of  finding  ilie  likelihoinl  of 
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Figure  7:  Likelihood  Ratios 
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I  he  oiiti’r  le\e|  rollsists  of  1 1 1  it  X  i  tl  1  i/ i  1 1  g  the  result  of  the  illlli’r 
level  OMT  rhoit  es  of  I  //,.//,,.  it  IT  ‘  ) .  Fsing  the  dil.ll  I'rolilem. 
the  inner  opt  i  mi/at  ion  is  o\et  5  \  a  riti  hies  a;.d  the  inter  is  o\er 
1  valitibles.  I  he  w  It  ol  e  i  oill  p  11 1  a  I  ion  isiloneie.era  1  dillien 
Slollal  grid  of  \idlies  lor  /'.  1  he  1  Villi.. Id's  ol  the  outer  op 

tinii/ation  must  olie\  soiii"  lonstraitils  to  be  v.ilid  moments 
If  til  her  I  hati  i  le’ik  whet  her  earh  t  ritil  point  of  i  he  out  it  opt  i 
ii.i/ation  is  possible,  it  is  easier  t'l  I'Xlend  the  inner  funitioi, 
•  IS  d*‘s(  rilieil  in  Seriion  2.  .\s  before  .m.dvtii  deriv.il  i\es  are 

.iSailtible  for  the  outer  opt  I  mi/, it  if  III  N  iirnerictil  p<  t  foi  m.iin  e 
is  improi-ed  liv  leiiiefing  and  si. ding  Imth  the  \  .iiid  V  v.iti 


Cl  It  1  el  at  1. 'll  li  I  w.  sai  .  1. 1 1  p  lilt  ;  inai  t  .It  at  e 


446 


7.  Acknowledgements.  'I’he  auliior  would  like  to 
thank  Nils  Iljort,  Michael  Steele,  llradley  Efron  and  Tom 
DiCiccio  for  helpful  comments.  This  research  was  supported 
by  National  Science  Foundation  Grant  nM.S86-002:i.'>. 
nKFEUENCES 

DiCiccio,  T..1..  Hall.  P..I.  k  Itomano  .I.F.  (1988).  "  Com¬ 
parison  of  Parametric  and  Empirical  Likelihood  Funr- 
tions".  'rechnical  Heport  No.  291.  Dept,  of  Statistics. 
Stanford  University. 

Larsen,  R.J.  &  Marx,  .M.L.  ( 198G).  An  Ini roiliiction  to  Mulh- 
emnliral  Utatistirs  and  its  Applications.  Prentice-Hall. 
Englewood  ('liffs,  .New  .lersey. 

.Miller.  R.Ci.  ( 19.80).  Ilryond  A\0\A.  Ilasics  of  .\pplicd  Stat¬ 
istics.  .New  't'ork;  .1.  Wiley  .U  .Son.s. 

Owen,  ..\.li.  (1987).  ■•Empirical  I.ikelihood  Ratio  Confidence 
Regions  ".  rechiiical  Re[)ort  No.  28:t.  Depi.  of  Statis¬ 
tics.  Stanford  University. 


Owen.  A. H.  ( 1988).  Empiriial  Likelihood  Ratio  Confidence 
Intervals  For  a  Single  Functional.  Hiomct rika  75  .No. 
2. 

Pshenichny,  B.N.  k  Danilin,  Yu.M,  (1978).  Sntnrrical  .Mali 
ods  in  Extremal  I'rohleins.  .Moscow:  Mir  Publishers. 

Pre.s.s,  W.H.,  Flannery.  H.P..  Teukolsky.  S..\.  k  \'etterling. 
W.T.  (198G),  iYumerical  Hcciprs.  Cambridge:  Cam¬ 
bridge  University  Press. 

Rheinboldt.  W.C.  (197.1).  Methods  for  Solving  Systems  of 
Aonlincar  Fipiations.  Conf.  Series  in  .\ppl.  Math.. 
No.  M.  Philadelphia:  SLAM. 

Wilks,  S.S.  (19.98).  The  Large-Sample  Distribution  of  tlie 
Likelihood  Ratio  for  I'esting  Comiiosite  Hypotheses. 
.Ann.  Math.  Statist.  9.  GO  G2. 


447 


COMPUTING  EXTENDED  MAXIMUM  LIKELIHOOD  ESTIMATES  FOR 
LINEAR  PARAMETER  MODELS 

Douglas  B.  Clarkson,  IMSL,  Inc.  :ui<l  Rolx'it  I.  Jcnnrich,  UCL.4 


Summary 

Methods  arc  given  for  computing  cxteiulfil  iiinxinttim 
likelihood  estimates  in  which  one  or  more  i>arameter  es¬ 
timates  are  infinite  at  the  suiiremum  of  the  likelihcxxl. 
The  results  aie  given  for  a  hrotul  cla.ss  of  regression- like 
models  based  on  independent  observations  with  linearly 
related  parameters  including,  in  particular,  the  gc-ner- 
alized  linear  models.  The  estim.ition  consists  of  two 
stei).s:  a  linear  programming  step  to  idi-ntify  th<‘  infinite 
components,  and  a  more  conventional  function  o]nimiz:i- 
tion  step  to  optimize  the  remaining  finite  com|xm<'nts. 
Provision  is  made  for  nuisance  iiarameters.  Two  tilgo- 
rithms  are  jiresented  ;md  extimples  illustniting  their  use 
tire  given. 


1.  Introduction 

When  a  few  elements  of  a  vector  of  estimtites  are  in¬ 
finite  at  the  supremum  of  a  likelihood,  sttmdard  compu¬ 
tational  algorithms  fail  since  convergence  to  infinity  is 
nitt  possible.  Haberman  (197'1,  appendix  B)  defines  and 
gives  an  example  of  such  estimates  for  freiiuency  dtita. 
but  he  gives  no  rom[)Utation:il  algorithm.  He  calls  the 
estimates  obttiined  extended  niiixiinnin  likelihood  esti¬ 
mates.  Because  these  estimates  contain  infinite  vtdues, 
they  do  not  exist  in  the  usual  sense  and  otlx'r  tiuthors 
(Silvapulle  and  Burridge,  I'JSG.  .Anderson  and  .Mix  ft, 
1984)  have  felt  that  detecting  the  presence  of  infinite 
estimtites  is  sufficient.  The  algorithms  they  develoixxl 
check  for  "existence”  (i.e..  finitene.ss)  of  the  estimates, 
and  leave  the  user  to  respond  to  the  situation  by  ailjust- 
ing  the  moflel,  obtaining  adilitiontil  data,  or  halting  the 
analysis.  Following  Habermtin  (1974),  we  feel  thiit  infor¬ 
mation  such  as  the  oiitimal  log-likeliluxid  ;md  ])arame- 
ter  estimates  tissociated  with  each  ob.servtition  is  useful 
aiifl  can  be  obtained  by  computing  the  extended  max¬ 
imum  likelihtxxl  estimates.  We  give  efficient  computer 
tdgorithms  for  such  computations  for  models  involving 
linearly  related  ])arameters.  including  the  "generalized 
linear  models”  of  Xehler  anti  Wedderburn  (1972),  lin¬ 
ear  mixlels  in  survivid  antilysis.  and  censored  regression 
mtxlels. 

The  next  section  discusses  extended  maximum  liki'li 
hood  estimation.  Sections  3  and  5  give  theorems  relateil 
to  the  computation  of  these  estimates,  Sections  4  and  0 
give  the  computation;!!  tdgorithms,  and  Section  7  gives 
some  examples. 


2.  Extended  Maximum  Likelihood  Estimates 

Let  0  be  a  suijset  of  Jf"  and  for  E  0  let  f(^)  be  a 
log-likelihood  corresponding  to  some  observed  data.  Let 
tl?  denote  the  extended  retd  line  [  — ac.oc].  We  say  that 
0  E  :R'‘  is  an  extended  wnxiinuin  likelihood  estimate  if  D 
is  a  limit  point  of  0  and  for  every  setiuence  E  0  that 
converges  to  ff, 

lim  / ( )  =  sup  f(fl). 

i9  g  0 

This  is  etpnvalent  to  saying  that  RS)  can  be  continuously 
extended  so  thtit  its  domain  includes  ff  and  that  (9  is  a 
maximum  likelihood  estimate  over  the  extendi'd  domain. 
Xote  that  an  ordinary  maximum  likelihood  estimate  is 
an  extendi'd  maximum  likelihood  estimate. 

Consitler  the  binomial  log-likelihood 

f(;r)  =  c  4-  y  log  tr  -)-  (n  -  y)log(l  -  tt).  0  <  tt  <  1. 

If  y  =  0,  then  it  =  0  is  not  in  the  domtiin  0  <  tt  <  1 
of  (.  but  is  an  extendi'd  maximum  likelihixxl  ‘'stimate. 
Similiirly,  under  the  logistic  iitirameti'rization 

i’' 


>1  =  -  X  is  an  extended  maximum  likelihiHtd  estimate. 

Con.sider  the  normal  distribution  .V(i/.l)  and  a  cen- 
.sored  (d)servation  y  >  y„.  The  log-likeliluxxl  is 

G('/)  =  e  -f  It’S  ^  dij,  —x<ii<x. 

Here  y  =  x;  is  an  extended  mtiximum  likelihood  esti- 
mati'. 

Most  of  our  discussion  will  deal  with  linear  /larameter 
moiie/.s.  By  this  wi'  metui  models  involving  indeiiendent 

observations  yi . y,,  witli  probtibility  density  or  mass 

functions  of  thi'  form 

/(.'/•I'/,  I.  '/,  -  x,/L  (2) 

Such  mixlels  are  ctilled  "models  with  a  linear  ptirt  ”  by 
Stirling  (1984),  but  this  terminology  is  by  no  means  stan 
dard.  By  either  name,  thesi'  models  incluile  the  gi'iieral 
i/ed  linear  nxxlels  of  Nelder  tind  Wedderburn  (  1972)  anil, 
as  St  iriing  jxtints  out .  mtmy  ot  her  mixli'ls  as  well,  includ 
ing,  for  examitle.  tlx'  censori'd  normal  example  above. 

The  parameter  vector  l3  in  (2)  is  comtnon  to  all  of 
the  observtit ions,  and  is  reltited  to  the  setdar  i>ar;uuet('r 
I/,  for  the  ('''  ob.si'rvalion  by  the  vector  x,  of  covtiriati'  or 
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design  values.  The  77,  are  clearly  linearly  related  and  may 
be  displayed  in  vector  form  as 


V  =  XI3, 

where  X  contains  the  x,  as  rows.  Let  M  denote  the 
column  space  of  X.  Then  ry  ranges  over  which,  for 
the  purpose  of  our  development,  is  an  arbitrary  subspace 
of  3?". 


3.  Results  for  Linear  Parameter  Models 


Linear  parameter  mo<lels  have  likelihoods  whose  log¬ 
arithms  have  the  additive  form 

=  V  =  MeM.  (3) 

1=1 

where  M  is  a  subspace  of  and  the  functions  (,  in  the 
sum  are  defined  on  the  real  line  3?. 

.\ssuming  it  exists,  let 

f,(oo)  =  lim 

and  define  f,(  —  oc)  .similtirly.  We  will  u.se  the  following 
additional  a.ssumpt ions  tibout  the  functions  (,: 


Assumptions:  For  each  i  =  1 . ri 


1.  f,  i.s  cotitinuou.s  and  hounded  on  R. 

2.  f,(oc)  =  -oc, 

3. 


f  I  1  I 


Under  these  assum])tions  the  t,  have  on<'  of  the  two  forms 
dis])layed  in  Figure  1. 


h.  .sijigfy  drseendine 

Figure  1.  .4ssnmed  Forms  for  the  Likelihtxid  Terms 

We  say  those  (,  of  form  (a)  are  donlily  f/eseemhng  and 
those  of  form  (b)  are  singly  descending.  The  choice  of 
right  descension  for  singly  descending  f,  is  arbitrary;  a 
left  descending  (,  c;in  be  made  right  descending  by  ri'iia- 
raineterizing. 

In  the  case  of  the  binomial  examjtle  with  1  he  logistic 
parameterization,  there  is  only  one  (\  and  it  is  right  de¬ 
scending  when  1/  =  0.  left  descending  when  ;/  —  n.  and 
donlily  de.scending  otherwise.  The  ceiisoied  normal  (,  is 
left  d<-sreniling.  It.  ;md  th('  binomiid  (,  when  ;/  =  n. 
would  have  to  be  repiuatneterized  to  satisfy  assnmjitions 
(2)  :in<l  (3). 

Given  a.ssniniition  (3).  we  make  the  following: 

Definition  1:  Ciioose  rj"  C-  M  so  r/'  <  0.  1/*  0 

whenever  L(  -  >s  )  .  and  1]’  Inis  n  iniixiniinn  nuniher 

of  negative  eoin/ionenfs.  Let  P  ~  :  i/‘  -  ()}. 

By  foiisidfi iuj''  (’ouvf'X  (‘()inl>inat ions,  it  is  (xisy  to  si  r 
tliat  ?/’  <‘xists  an<l  P  is  uiiif|urly  (hlhu'd.  Thr  followinji 
fhcorcin  will  Ix'  usfvl  to  identify  the  Hnite  jiaif  of  an 
<*xlen<l<*d  niaxiinnni  likelihood  estimate  ft>r  i}  in  (3). 

Theorem  1:  f  /n/er  a>sn/n/Wjozj.s  ( J ).  (1^).  and  f.'i).  if 
P  is  not  <’;n;>f 

-  y]  1,1’/,).  t]  I  M  (1) 

I  f :  P 

liiis  a  nia.\j//iinn 

Pntfif  AssuiiH'  does  not  has  a  maxininni  in  At. 
<"hoose  a  seqnenue  i  .t'f  so  (liaf 

^  Vw,  1  ‘  SIIJ)  / ' .  (  T)  I 

Choosf’  a  suhsefjneiiej'  of  so  that 
flir  ?/',  ♦  (I 

whf'ie  and  <!',  an-  the  I'  ln'.'h  ami 
diieetiMn  ««f  Sniff'  / ’  ha-s  ii'»  niaxunuin.  /»',  *  “x 


If  d,  >  0  for  some  i  £  T>,  f, — *  — cx;  by  assumption 
(2).  Since  each  f,  is  bounded  above  tiy  as.sumi>tion  (1), 
~oo.  This  contradicts  (5),  and  lienee  d,  <  0  for 
all  i  6  "D.  Assume  d,  <  0  and  f,(— oo)  =  — oo  for  some 
i  €  P.  Again,  f’iv'm)  contradiction 

implies  d,  =  0  whenever  f,(— oo)  =  -oc  and  i  €  T>.  Since 
by  the  definition  of  P,  f,(— oc)  =  — oo  implies  i  t  P. 
d,  =  0  whenever  f,(— oC')  =  —  oc. 

Assume  d,  =  0  for  all  i  £  P.  Then  f*{0)  =  P(tj'„) 
and  by  (5),  f(0)  =  sup^^^  ^  (‘{'H)-  This  contradicts  the 
assumpti'-n  that  (’  has  no  maximum  on  .V(.  Thus  d,  <  0 
for  at  least  one  i  £  P. 

Let  T)‘  be  any  vector  that  flefines  P  in  Definition  1, 
and  let 

d  =  d  +  oTj'. 

Then  for  o  sufficiently  large,  d  <  0,  and  (/,  =  0  wlu'never 
r,(  — oc)  =  — oc.  Note,  however,  that  d  has  at  least  one 
more  negative  component  than  r;'.  This  contradicts  the 
definition  of  t]'  and  imjilies  that  f*  has  a  maximum  on 

M. 

The  following  theorem  will  tell  how  to  construct  ex- 
teruh'd  maximum  likelihood  estimates. 

Theorem  2:  I'nder  a.ssiiinption.s  (1),  (2).  and  (3). 

supf  =  sup  Y.  Y  (C) 

T]cM  ,  g  p  j  ^  pr 


Proof:  Let  fj  £  ,V(  mtixiinize  ('  ;is  defined  in  Theorem 
1  aiul  let  T]'  be  tis  defined  in  Definition  1.  Then  for  any 
o  >  n. 


sup  f  > /(  i)  f  or;' )  Slip  P  -t  Y,  '/i  +  '•'/,"  )■ 

For  I  (z  P'  .  as  o  oc.  bl'A  +  '''/"I  f,(  -  x. ).  SiiK'e 

I J  -  X  )  =  sup  ( ,  for  !  £  D'  . 

suiD  £supP  (  Y^  supt,. 

1  e  P' 

File  o|)])osi!e  iue<iuaht/  is  easy. 

To  construct  an  extended  uiaximuiii  likelihood  esii. 
mate  under  iLssumiilioiis  ( I  ).  (2).  and  (.3),  fin<!  P  a.s  Kiveij 
in  Definition  1.  1  hen  find  a  m;i.\iini/er  r;  £  ,V(  of  !'  as 
defined  by  (4).  This  exists  by  Theorem  1.  Since  under 
I  hese  s;une  assuin|)t  ions  sii])  f ,  -  f,!  -  t )  for  each  /  ('  P'  . 
it  follows  from  Theori-m  2  that 


-  'c  if  P' 


is  an  extended  maximum  likelihood  estimate.  1  Inis,  in 
theory  at  least,  otie  sini|ily  needs  to  find  P  and  maxiim/i- 
C. 


The  next  section  will  show  how  to  construct  P  using 
linear  programming.  This  will  be  combined  with  an  oj)- 
timization  program  to  give  the  first  of  the  algorithms  of 
section  5.  The  second  algorithm  ti.ses  an  initial  run  of  the 
optimization  program  and  the  following  theorem  to  help 
fin<l  P. 

Theorem  3:  If  imtlrr  nssiiinpt ions  (1).  (2).  and  (3) 
rack  singly  desevnding  f,  has  the  property  that 

(7) 

for  all  t  and  if 

I  £  5 

has  a  inaxinuiin  at  r;  £  ,V(.  then  S  C  P. 

Proof:  Assume  S  is  not  a  subset  of  P.  Using  t]'  from 
Definition  1.  coilsidi'r 

+  />»?’)=  (•{'!<)+  Y  + />'/,’)• 

t  e  5  n  P  t  £  5  n  p"' 

For  each  (,  in  the  second  sum 

fiff/i  +  />'/,')  M-x) 

as  /I  -»  3C.  Since  each  of  ihesi'  (,  is  singly  descending 
In'  Definition  1.  it  follows  from  (7)  that  for  some'  p  siiffi- 
I'ii'iitly  large, 

+  f»h)  >  ) 

for  each  such  7,,  and  hence  that 

This  contnidicts  the  assumption  that  f;  m;iximi/es  7>^. 


4.  Finding  P  by  Linear  Programming 


Recall  the  parameterization  rj  =  \j3  introduced  in 
Section  2.  Let  X|  denote  the  subm.'itrix  of  X  ront.'iiniug 
the  rows  that  corresiiond  to  the  doubly  descending  7, 
(;md  possibly  additional  rows  with  indices  known  to  be 
in  Pi  and  let  X..  denote  the  submatrix  obtained  from  the 
n-maining  rows.  From  Delinition  1  we  seek  a  fi  such  that 


X,D  II, 

X.d  '  0, 


anil  X..  Ir  .  as  many  uegatiw  comiioiients  as  possible.  If 
the  columns  of  X]  are  linearly  independent,  d  (1  is  the 
only  solution  to  (S)  and  X;,3  can  ha\e  no  uegatoe  com 
Jiolients.  I  hen  P  Is  the  comj'lete  eet  o|  luteL;el',  Intm  1 
to  II.  Otherwise,  using  appropriate  row  1 1  ausfoi  mat  !■  >us. 
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(8)  can  be  put  in  the  equivalent  form 


have  the  same  su])remum.  If  maximizes  (13)  and 


X„/3, +X,2/32  -  0, 

X22/32  <0,  '  ' 

vvlicre  X||  is  a  submatrix  of  X|  with  linearly  infle|>en- 
dent  columns  spanning  the  si)ace  of  Xj.  The  problem  of 
finding  T>  is  reduced  to  finding  a  siich  that  \-2i02  —  ^ 
and  '^■2202  ^  many  negative  components  as  possible. 

This  is  equivalent  to  finding  /3.^  and  6  so  that 


X22/3. +  «  =  0. 

6  >  0, 


(10) 


and  6  has  as  many  positive  components  as  possible.  L's- 
ing  row  transformations  to  eliminate  the  “free  variables" 
02  Jcs  described  by  Luenberger  (1984.  page  13),  (10)  can 
be  written  in  the  equivalent  form 


'■  e  V 
—  oc  ;  € 


then  is  an  extendetl  maxinmm  likelihood  estimate. 


6.  Two  Algorithms 

Let  T  tlenote  the  set  of  indices  ;  for  whicli  /,  is  rloulilv 
descending.  The  simplest  algorithm  consists  of  two  steps. 
The  first  is  to  solve  the  linear  ]>rogramming  iirobleni  in 
Section  4  with  Xi  containing  thi‘  rows  of  X  with  indices 
in  T.  This  gives  T>.  .4  genertil  optimiztition  tilgoiillim 
such  as  Fisht'r  scoring  or  Newton-Ra])hsou  is  then  :ip])h<‘d 
to  imiximize  ( 13). 


.46,4-62  =  0. 

6, .6-.  >  0, 


(11) 


where  6,  and  62  tire  suitvectors  of  6. 

To  find  a  6  with  as  mtiny  positive  components  :is 
jto.ssible,  apply  the  simplex  idgorithm  to  maximize  llie 
function  /{6)  =  H,  c,6,  under  the  constr.iint  (11).  wh<'re 
initially  eiich  r,  =  I,  Iterate  until  convergt'iice  or  tiiitii 
all  elements  in  the  column  tihotit  to  enter  the  b.isis  iire 
noniiositive.  At  this  point  there  is  <a  feasible  solution  to 
(11)  that  has  [tositive  vtdues  for  t  he  varitible  about  to  en¬ 
ter  ;md  for  all  basic  variables  coivespouding  to  neg.itive 
values  in  the  cohmiii  about  to  enter.  Rather  thtin  i>er 
forming  the  jiivot,  set  the  coeflicients  e,  for  ;t!l  of  thes<’ 
v.'iritibles  to  ^ero.  The  reth'ced  coefheient  for  the  vari- 
•able  tlnit  w;is  idiout  to  enter  will  now  b<‘  zero.  Continue 
with  the  simplex  algorithm  in  this  modified  nninner  until 
it  converges.  Coefficients  r,  =  1  at  convergence  identify 
elements  in  P.  More  specifictilly,  T>  is  identified  bv  the 

indices  of  the  rows  of  X,  and  the  indices  of  the  rows  of 
X2  correspotiditig  to  c,  ^  1  ;it  cotivergeuc,'. 


r>.  Nuisance  parainttfers 

I'or  ;ipplic;iti<ins  it  is  useful  to  de;il  with  models  th;it 
;ire  .'I  little  tufire  general  than  the  line.-u-  par.ameter  models 
of  the  previous  sei-tion.  Considei  a  log  likelilaKid  of  the 
form 

1  I 

where  r/  r  ,\d  and  f/,  f  'k  is  ;i  set  <]f  nuisance  p.arameteis. 

If  for  each  <p'  *  )*'’  fu net  ions  I ,{ I .  )  of  t  s;it  isfy  tissuinj) 

t  ions  ( 1  ).  ( 2 ) ,  and  ( 3 ).  ;ind  if  the  set  1  if  siiigly  desci-tiditi.g 
retmiins  the  same  for  all  vtilues  of  0.  then  by 
Theorem  2.  ! ( if.  4> )  aiid 

y '  c (»/,.'/'  I  •  ^  f  4  A .  d, )  1 1 .3 1 

I  (  n  , ,  p< 


.4ti  alternat  i v<-  apitrottch  t h,at  sitnplihes  the  litietir  pro- 
grtimming  step,  aiul  ttvoids  it  ('iitiia'ly  wlu'ti  an  extetided 
estinnvt,'  is  not  rerpiiretl,  proceeds  as  follows;  .Apjtly  the 
optimizing  tilgorithm  to  the  complete  ((if.(*>]  as  given  by 
(12).  If  duritig  the  iteriition  a  compotieiit  of  if  ;ip- 
|)e:irs  to  be  too  tieg.itive  iind  if  t  ^  T.  elitnitnite  the  term 
frotn  ( 12).  Cotititiue  in  this  niiinner  until  no  fur- 
t her  terms  are  eliminat ed.  Let  denote  t  he  indices  of  the 
terms  which  hiive  not  beeti  eliminiited.  If  the  algorithm 
cotiverges  to  ;i  intiximutn  of 

7]  C:  .Vi  and  (i>  6  thm  \>y  tlH'orrm  3,  S  ^  D  and 
lh<*  liiM'iir  pm^raniinin^  in  S(‘('ti«)n  4  fan  ix'gin 

with  Xj  <*(>rn-s))(>ndinK  to  tin'  row.s  f»f  X  with  in<lif<'>'  in 

Fnun  tliis  ]>oint  tin*  alt2;oi‘il hni  i)rof<'<'ds  as  in  thr  hist 
alj^orithni  with  T  r<‘|>laf(<l  !)y  5. 

In  |>ra<-tifr  one  may  or  may  not  know  ono's  alji;orithm 
(»rodnr<*s  a  global  oi)timnm  of  (11).  If.  for  <'Xam])lf,  on«'  is 
<»nly  siu(‘  it  prodiuM's  a  local  maximnni  or  prihaps  a  sta 
fionary  point,  tin’ll  a  formal  a])])»'al  to  Thoorom  3  i''  not 
jK»ssil>h\  but  thf  sreond  al,ii;orithm  oftf'ii  ‘'('mis  to  work 
in-vrr  til*’  loss  and  may  be  usefully  viewed  a.s  a  heuristic 
ali:.<  >nt  hill. 

1  h<'ie  are  often  very  jiractical  met  la  ids  fm-  deti'nniu 
inn,  wh<‘n  an  t/,  i.'  t<io  nei;at  ive.  For  example,  in  the  l'»n.is 
tic  iiKulel.  ami  in  many  other  models.  i,%*is  too  n<’u;atl\t‘ 
when  )  IS  v<’iy  close  to  I) 

7.  Kxaniples 

It  is  <’at>y  to  show  lliat  thf  assuinj'tn  n-'  of  sr.  tittn'- 
aiul  o  hold  for  intenal  <lata  in  hue:!!  paranietrr  nio<l 
«ls  base-fl  iiptin  liinoinial.  iirnatu''  binomial,  loi;ant  hmic. 
Ion,  n<ii mal.  exponent  lal,  \\r  ibull.  and  rxt  n  me  \aliie  'h'- 
tiibutioiis.  1  h(’  assiiniptioii^  ‘I'l  not  hohl  f,  the  (’o\ 
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proportional  hazards  model,  which  can  also  suffer  from 
the  problems  of  infinite  estimates.  The  algorithms  may 
still  yield  extended  maximum  likelihood  estimates,  how¬ 
ever.  Further  research  is  needed. 

FORTRAN  subroutines  for  the  algorithms  have  been 
implemented  and  will  eventually  be  available  in  the  IMSL 
(1987)  libraries.  To  illustrate  the  advantages  of  each  of 
the  algorithms,  consider  the  data  in  Table  1.  A  two-way 
additive  factorial  logistic  model  with  standard  restric¬ 
tions  was  fit  to  the  data. 

Table  1:  Lof'istic  Rc/rrcssion,  Binomial  Data 


Obs. 

Cell 

n 

y 

11 

1 

(IT) 

p  +  o  -y  3 

5 

5 

2 

(1,2) 

H  A  a  —  d 

3 

5 

3 

(2,1) 

p  -  o  A  a 

1 

5 

4 

(2,2) 

fi  —  n  —  d 

0 

5 

With  algorithm  1,  observations  2  and  3  wen'  iilenlifi<'d 
as  members  of  the  set  T>  in  the  linear  programming  sli-p. 
and  a  quasi-Newton  algorithm  then  recpiired  3  iterations 
to  converge  to  the  optimal  likelihood  based  upon  these 
two  observations.  For  algorithm  2  the  <(uasi-. Newton  rou¬ 
tine  required  S  iterations  and  yielded  observations  2  and 
3  as  mcmlx;rs  of  the  set  S.  Tlu'  linear  programming  al¬ 
gorithm  then  verified  that  observations  2  and  3  were  the 
only  elements  in  V.  The  estimated  coefficients  o.  and 
.y  were  different  in  the  two  algoritinn.v,  Ijiit  this  is  to  !«• 
expected  since  the  estimated  coefficients  are  not  unique 
when  infinite  estimates  occur.  Also  as  expect<'d,  the  |>r<'- 
dicted  r'/i  and  probabilities  in  each  cell  were  identical,  as 
were  the  optimal  log-likehh<K)ds. 

A  .second  example  illustrating  why  idgorithm  2  might 
be  preferred  u.ses  the  same  data  as  in  Table  1,  but  tal)U- 
lated  as  Bernouli  trials.  This  form  for  the  data  is  giv<'n 
in  Table  2.  Under  the  column  "Frcfi."  is  the  number  of 
observations  for  which  the  outcome  applies.  Thus  there 
W(.’r(’  3  trials  in  which  a  “1"  (succ('ss)  was  obs<'rv('d  in  cell 
(1,2),  while  2  trials  in  this  cell  were  "O"  (failures). 

r at )le  2:  Logistic  Regression,  fiernoiih  Da / a 


Obs. 

Cell 

'/ 

Fret). 

y 

1 

(1.1) 

p  A  o  +  .i 

0 

1 

2 

(1.2) 

p  +  o  -  ,f 

3 

1 

3 

(1.2) 

p  )  ti  -  ,f 

2 

0 

4 

(2,1) 

p  -  fi  -f  .1 

1 

1 

t'J 

(2,2) 

P  ~  o  y  d 

4 

0 

G 

(2.2) 

P  -  o  ~  d 

5 

0 

In  algorithm  1  no  observations  were  initially  id<'ntili<'i; 
as  menibf'rs  of  the  set  T.  ami  th<‘  hm’ar  ]>i()graniniing 
using  all  observations,  selected  observations  1  and  0  as 
observations  with  infinite  r/,,  Tlie  optimization  routines 
required  3  iterations  to  converge,  f'or  algorithm  2  tlu' 
optimization  routine  required  7  iterations  and  iil<-nl itie-l 
fibservations  1  and  G  as  observations  with  potent  iallv  in 


finite  r/,.  The  linear  programming  \'erified  that  the  r/, 
for  these  observations  were  infinite.  Because  algorithm 
1  mu.st  first  .solve  a  linear  programming  problem  based 
upon  all  of  the  data,  while  algorithm  2  converged  di¬ 
rectly  to  the  final  solution,  which  was  then  verified  by  a 
smaller  linear  programming  problem,  algorithm  2  would 
prolrably  be  preferred.  Obviously,  this  becomes  more  im¬ 
portant  as  the  size  of  the  data  set  increasi's. 

8.  Conclusions 

Some  algorithms  for  computing  extended  maxinium 
likelilRKid  estimates,  theorems  justifying  their  use,  and 
extunples  illustratitig  the  jx'rformance  of  the  tilgorit Inns 
have  been  presentetl.  The  theorems  have  been  gi\  en  fir 
linetir  ptiramefer  models,  but  the  algorithms  nitty  finil 

ai)phc:ition  iti  a  tuore  getieral  context.  Of  the  two  tdgo 
rithtns  pre.sented,  algorithm  1  requires  fev.'T  tissutiiptions 
tind  thtis  tatty  be  more  robust.  .Algoritliin  2.  however,  is 
often  more  cotivenii'nt  cotnituttitionally.  and  may  thus  be 
jireferred  where  ttpplicable.  Becteise  infinite  estiniatrs 
tire  protKibly  tint  too  comtnon,  it  would  seem  reasonable 
to  apply  the  tdgorithms  discussed  here  only  after  an  al- 
gorithtn  for  finite  estitnates  had  failed. 
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SIMULTANEOUS  CONFIDENCE  INTERVALS  IN  THE  GENERAL  LINEAR  MODEL 
Jason  C.  Hsu,  The  Ohio  State  University 


Abstract 

Consider  the  general  linear  model  (GLM)  Y  =  +  e. 

Suppose  pi,  ,  pk  (k  ^  p)  are  of  interest;  Pi, ....  pk  may 
be  treatment  contrasts  in  an  ANOVA  setting,  or  regression 
coefficients  in  a  response  surface  setting.  Computing  the 
coverage  probability  of  simultaneous  confidence  intervals  for 

Pi,  ■■■  ,  Pk  by  iterated  (k+l)-dimensional  integration  is 
impractical  for  all  but  the  smallest  data  sets.  We  proposes  to 
approximate  the  probability  as  a  mixture  of  products  of 
univariate  normal  probabilities  so  that  the  number  of 
functional  evaluations  becomes  linear  in  k.  The  performance 
of  this  approximation  is  demonstrated  in  a  variety  of 
settings. 

1.  Simultaneous  Statistical  Inference  and  the  von 

Neumann  Bottleneck 

Consider  the  general  linear  model  (GLM)  Y  =  +  e, 

where  YnxI  is  the  vector  of  observations,  Xjsjxp  is  a  known 
design  matrix,  £  =  (Po,  ....  Pp)'  is  the  vector  of  parameters, 
and  CNxi  is  a  vector  of  iid  NormaUO.o^)  errors  with 
unknown.  Suppose  Pi,  ...  ,  Pk  (k  <  p)  are  of  interest  and 

are  estimable;  Pi . Pk  may  be  treatment  contrasts  in  an 

ANOVA  setting,  or  regression  coefficients  in  a  response 

surface  setting.  Let  0,  =  (^  l ,  ,  "^k)'  denote  the  BLUE 

(best  linear  unbiased  estimator)  of  (Pi,  •••  ,  Pk)'.  Tlien  (^|, 

,  0k)'  is  multivariate  normal  with  mean  (Pi,  •••  ,  Pk)'  and 
variance-covariance  o^V  where  V  is  the  k  xk  sub-matrix  of 
the  generalized  inverse  of  X'X  corresponding  to  Pi,  , 
Pk.  Assume  V  is  non-singular  and  let  s^  =  MSE  denote  the 
usual  estimator  of  o-,  vs-/o-  has  a  distribution  with  v  = 
N-p  degrees  of  freedom. 

To  give  two-sided  simultaneous  confidence  -.ntervais  for 
Pi,  ,  Pk  with  exact  coverage  probability  1-a 

P(0i  -  Iqlssfvjj  <  pi  <  0  -H  IqIsVvii  for  i  =  1,  ■■■  ,  k|  =  1-a 
we  need  the  quantile  Iql  such  that 

P(,__'|«\(  10, -P,l/s-/vii,  )<lql)  =  I-a.  (G 

To  solve  for  the  quantile  Iql,  the  probability  (  I0i  - 

Pjl  /  sVVj7  )  <  Iql)  has  to  he  computed  for  candidates  Iql. 
which  involves  (k+1  )-diniensional  integration  if  one  naively 
integrates  over  s  and^  k  ■  '■  -  turn.  If  m-point 
univariate  Gaussian  quadrature  is  performed  iteratively,  then 
roughly  2m*‘''''  evaluations  of  the  univariate  normal 
distribution  function  O  or  equivalent  (e.g.  .Schervish  1984) 
is  required.  Thus  for  all  but  the  smallest  k,  the  von 
Neumann  bottleneck  prevents  the  compuMiion  of  this  iter.ited 
integral  from  being  practical  for  interactive  statistical  data 
analysis. 

2.  Existing  Methods 

Traditionally,  this  'curse  of  ilimensionality"  is 


sidestepped  by  replacing  the  multivariate  normal  compulation 
by  computations  involving  individual  10;  -  Pil  (using  the 
Bonferroni  inequality  or  Sidak's  inequality)  or  computations 
involving  pairs  of  I0i  -  pil  and  I0j  -  pjl  (using  the  Hunter- 
Worsley  inequality).  As  these  procedures  are  based  on 
conservative  probablistic  inequalities,  the  resulting 
simultaneous  confidence  intervals  can  be  much  wider  than 
necessary.  The  projection  method  of  Scheffe-Working- 
Hotelling  avoids  integration  altogethei  by  obtaining  a  100(1- 

a)%  confidence  ellipsoid  for£  based  o.i  the  F  distribution, 
then  projecting  the  ellipsoid  onto  the  Pi,  ■  ,  Pk  axes. 

These  projected  confidence  intervals  for  pi,  ,  Pk  tend  to 
be  even  wider  than  the  probablistic  inequality  confidence 
intervals,  (see  Fuchs  and  Sampson,  1987,  and  examples 
below) 

Though  not  stated  in  SAS  User's  Guide:  Statistics  (1985, 
p.  448),  the  MEANS  option  of  PROC  GLM  in  SAS®  for 

multiple  comparisons  ignores  any  covariaies  pk+i,  ■  ,  Pp  in 

the  user’s  model  when  estimating  pi.  ,  pk  (but  not  in 
estimating  MSE).  Clearly,  conclusions  reached  without 
taking  significant  covariates  into  account  can  be  totally 
misleading,  (see  Example  6.1  below)  Unsuspecting  users 
have  analyzed  and  published  scientific  findings  based  on  this 
option  of  PROC  CjLM  stating  incorrectly  that  covariaies 

Pk-f  I-  •  Pp  have  been  adjusted  for  (e.g.  Thatcher,  Walker, 
and  Si,udice.  1987). 

3.  Proposed  Algorithm 

We  propo.se  to  approximate  the  probability  in  (1)  by  2- 
dimensional  mixtures  of  products  of  k  univariate  nonnal 
interval  probabilities 


OO  +OC 


n  i(b(('/:-x,7.-t-iqis)/(i-c).p>C)- 

i=l  ' 


<P((\'cX,z-lql.s)/(l-c2.p'/-)|  dd>(z)  dRs) 


(2) 


where  c  =  ±1.  fl>  is  the  standard  nonnal  distribution,  F  is  the 

distribution  of  s/o.  and  A,i,  ■  ,  X.k  are  constants  that  depend 
on  X.  If  again  m-point  quadrature  is  employed,  then  (2) 
requires  2km-  evaluations  of  O  which  is  much  less  than  the 
21111^+1  evaluations  required  for  iterated  multivariate 
integration.  Note  also  that  2km-  grows  linearly  (as  opposed 
\o  exponentially  for  2m'^''’')  in  iiuxiel  size  k.  Thus,  using 
this  approximation,  sirnulttineous  confidence  inlcrsals  in 
GLM  can  Ix’  given  in  an  interactive  environment 

If  the  correlation  matrix  R  of  0  salisncs 


'^(  l-C>.|  1 

0  ^ 

p-l'^ 

R  --  (P,|l  = 

0  1 

\ 

1  1  CA.-) 

+  c 

(Xl  Xk) 

\xithc  -  +1  (called  siructurc-1 ..  sec  Tong,  1979),  ilicu 
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^(&i-Pi)/V^i^ 

At  A 

'a  E 

+ 

yl^k-Pk)/"'/^  y 

Zk^ 

where  Zi,  ,  Z^,  Zq  are  iid  N(0,1)  random  variables  and  = 
denotes  equality  in  distribution,  and  it  is  well  known  (e.g. 
Gupta  1963)  that  the  probability  in  (1)  can  be  written  as  (2). 
Nelson  (1982)  shows  that  when  (3)  holds  with  c  =  -1,  (2)  is 
still  a  valid  expressions  for  the  probability  in  (1)  provided 
one  defines 


a 

d)(a+ib)  =  (2i:)“’  J  e'l““'t')‘'''-du. 

— oo 


(5) 


Note  that  the  real  part  of  (5)  is  an  even  function  of  b  while 
the  imaginary  part  of  (5)  is  an  odd  function  of  b,  so  the  inner 
integral  of  the  imaginary  part  of  the  integrand  in  (2)  is  zero. 

While  textbook  ANOVA  examples  and  response  surface 
examples  tend  to  have  highly  patterned  design  matrix  X, 
leading  to  correlations  R  satisfying  (3),  the  same  cannot  be 
said  about  real-life  experiments:  the  data  may  be 
observational;  the  experimenter  may  not  follow  a  textbook 
design;  there  may  be  covariates,  or  missing  values.  Our 
proposal  is  as  follows.  Given  R,  find  the  Ri  satisfying  (3) 
"closest"  to  R.  Then  solve  (1)  with  Ri  in  pl.ace  of  R,  i.e., 

evaluate  (2)  using  the  3.i,  ,  Xk  of  Ri  to  obtain  a  critical 

value  Iqil  as  an  approximation  to  Iql. 

4.  L’tiliz.ing  Factor  Anatysis  Algorithm.s 

The  implementation  of  our  proposal  depends  crucially  on 
the  recognition  that  the  problem  of  finding  the  R+  with  c  = 
+  1  "closest"  to  R  is  the  Factor  Analysis  problem  of 
computing  the  "population"  correlation  matrix  R+  of  a  1- 
factor  model  (3)  that  besi  fit  the  "sample"  correlation  matrix 

R.  (In  Factor  Analysis,  =  (2,|,  ,  ilk)'  i.s  referred  to  as 

the  factor  pattern.)  Different  measures  of  closeness 
correspond  to  different  methods  of  Factor  Analysis.  For 
example,  the  Iterated  Principal  Factor  method  finds  the  R* 
that  minimizes  HR  -  R*ll  where  ii  ll  is  the  Fuclidcan  norm 
defined  by  IIAII  =  (XZ  laijl-)'/-.  The  Maximum  Likelihood 
I  J 

method  finds  the  H.,  that  minimizes  tracelR’^'R)  -  log(IR  ^* 

RI).  The  Generalized  Least  Square  method  finds  the  R,.  that 
minitnizes  trace((RR'J  I)-).  So  even  though  the  matrix  R 

IS  deterministic  in  our  setting,  we  can  use  existing  Factor 
Analysis  algorithms  to  compute  R+. 

Given  R,  it  is  possible  that  a  R_  with  c  =  1  can  come 

closer  to  R  than  any  13+  with  c  =  -^1 .  Structure  (3)  with  c  = 

1  has  no  meaning  in  the  usual  F;ictor  Analysis  setting. 
However,  it  is  well  known  th.it  R  has  structure  (3)  with  c  = 

1  if  and  only  if  R  '  has  structure  (3)  with  c  =  +1.  (see 
Graybill,  i9X3,  p.  180)  Therefore,  one  can  find  the  R- 

closest  to  R  ',  calculate  R  =  R  *  which  satisfies  (3)  with  c 

=  1 ,  and  let  R  i  be  either  R+  or  R_  w  hichever  comes  closer 

to  R 

When  R  satisfies  (3),  usually  the  Factor  Analysis 
algorithms  will  recover  the  correct  /.fs  so  the  critical  value 


given  by  our  algorithm  is  theoretically  exact  (i.e.,  Iqf  =  Iql). 

It  is  also  true  that  Iqil  =  Iql  for  any  R  with  k  <  3  because 
every  R  with  k  <  3  satisfies  (3).  The  case  k  =  2  is  trivial. 
For  k  <  3,  the  group  of  sign  changes  on  3.;  when  multiplied 
by  c  =  ±1  generate  all  possible  sign  patterns  of  pq,  i  j. 
Thus,  by  taking  logarithms  of  Ipql,  the  iXjl's  can  be  solved  as 
three  unknowns  in  three  linear  equations,  and  then  the 
proper  signs  can  be  attached.  As  R  departs  from  (3), 
because  of  the  continuous  nature  of  our  strategy,  graceful 
degradation  of  the  approximation  Iqp  can  be  expected. 

Given  a  data  set,  the  correlation  matrix  R  can  be  obtained 
by  applying  any  suitable  software  package  to  the  applicable 

model.  Then  X  can  be  computed  from  commonly  accessible 
Factor  Analysis  algorithms.  To  solve  for  Iqil,  ai?/ 
subroutine  has  been  written  which,  given  X  and  degrees  of 

freedom  v,  integrates  the  outside  integral  by  24-point  Gauss- 
Legendre  quadrature,  the  inside  integral  by  24-point  Gauss- 
llemiite  quadrature,  and  solves  for  Iqp  by  the  modified 
sec.ant  method.  The  entire  process  has  been  automated  in  the 
S®  environment  for  the  regression  setting.  Accessing  the  R 
in  the  structure  returned  by  the  regress  function,  a  factor 

function  has  been  written  which  returns  X  by  calling 

subroutine  FACTR  of  IMSL.  Accessing  0,  X,  MSE,  and  v 
returned  by  regress  and  factor,  a  glmci  function  has  been 
written  which  calls  ql  and  returns  a  structure  defining  the 
simultaneous  confidence  intervals  for  A  graphics 
function  ci  has  been  written  to  plot  these  confidence 
intervals. 

Examples  of  how  well  the  proposed  algorithm  perfomis 
in  a  variety  of  settings  are  provided  below. 

5.  Regression 

Orthogonal  designs  clearly  satisfy  (3)  with  =  0  and  can 
be  thought  of  as  0-factor  designs  in  our  setting.  1  lowcver,  if 
observations  from  more  than  one  design  point  are  missing, 
or  if  an  onhogonal  design  is  augmented  by  more  than  one 
additional  design  point,  then  generally  (3)  is  not  satisfied 
exactly,  But  the  approximation  lqil  is  often  real  close  to  Iql. 
as  the  following  example  with  observational  data 
demonstrates. 

5.1  Motor  vehicle  death  example 

Data  on  page  191  of  Draper  and  Smith  (1981)  gives,  for 
the  49  contiguous  states,  number  of  motor  vehicle  deaths  (Y) 

in  1964,  number  of  drivers  v  10  in  1964  (Xi),  number  of 
persons  per  square  mile  (Xs)  in  1963,  rural  road  mileage  x 
10  3  (Xj)  in  1963,  and  normal  January  maximum 
temperature  (X4).  'ITe  linear  movicl  Y  =  po  +  Pi-Xi  -r  P;X2 
1  PiX  t  -I-  P4X4  produces  an  of  0.96.84  with  the 
following  correlation  matrix  for(^l.  j^  t.  {^4 

/I  -  \ 

0,.S63800.8  1 

0.6676.838  0.606766:  1 

,,  ■0.2960668  0. 20,81862  0. 2066284  1  j 

I'sing  an  licralcd  I’rinciii.il  Factor  algonihm.  X  is  louiul  10 
be  (4)  8161860,  0.7128773,  0,8217866,  0.3083787)', 
leaving  a  residual  corrclalion  m.urix  of 
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/  0  -  -  -  A 

0.01779729  0 

0.003054976  0.02120399  0 

^  -0.04982275  -0.00941977  -0.04132158  0  ^ 

Based  on  this  1-factor  approximation,  critical  values  Iqd 
for  various  a  tu-e  then  computed.  To  check  the  accuracy  of 
these  approximate  critical  values,  40000  pairs  of  l]^i 

-  pil  /  were  generated  based  on  R  and  Rj  in  such  a 

way  that  a  control-viu'iate  variance  reduction  technique  could 
be  applied  to  reduce  the  standtu'd  deviation  of  the  estimate  of 
true  a  roughly  by  a  factor  of  3.  We  found 

Nominal  a  Iqjl  IJnh.  Est.  of  True  a  95%  Cl  of  True  a 
0.10  2.229697  0.10-^0.00006  (0.0993,0.1009) 

0.05  2.536990  0.05  +  0.00000  (0.0454,  0.0506) 

0.01  3.168086  0.01  -t- 0.00002  (0.0097,0.0103) 

It  would  seem  that  the  1 -factor  approximation  is  adequate  for 
practical  purposes.  The  following  table  comptires  the  critical 
values  associated  with  the  vtuaous  methods. 


Scheffe 

Bonferroni 

Sidak 

Factor  Analvsis 

a 

=  0.10 

2.883 

2.321 

2.305 

2.230 

a 

=  0.05 

3.215 

2.605 

2.597 

2.537 

a 

=  0.01 

3.888 

3.207 

3.206 

3.168 

6.  One-way  De.signs  with  A  Covariatc 
Assume  the  model 

Yia  ~  P(2(ia  -  X. )  -r  cja,  i  =  1,  "■,k,  a  =  l.  -'.nq 


compare  with  critical  value;  from  other  methods  as  follows. 


Scheffe 

Bonferroni 

Sidak 

Exact 

a  =  0.10 

3.538 

2.501 

2.484 

2.449 

a  =  0.05 

3.851 

2.756 

2.749 

2.721 

a  =  0.01 

4.470 

3.296 

3.294 

3.283 

Next  suppose  treatments  versus  control  are  of  interest, 
with  Potato  as  the  control.  The  BLUE  for  the  differences  of 
breaking  strength  are  as  follows.  (For  later  reference,  the 
estimates  employed  by  the  MEAN.S  option  of  PROC  GL.M 
in  SA.S,  which  do  not  take  the  covariatc  into  account,  are 
also  displayed.) 

Breaking  Strength  With  Covariate  Without  Covariate 


Sttirch  7- Starch  1  58.94  181.14 

Starch  7  ~  Stttrch  2  7 1 .36  265.43 

Starch  7  -  Starch  3  119.01  493.60 

Sttu-ch  7  -  Starch  4  146.08  437.25 

Stiirch  7  -  Starch  5  1 74.90  563.74 

Stttrch  7  -  Starch  6  1 80.59  667.68 


The  correlation  matrix  R  of  the  BLUE  (with  covariate)  for 
the  differences  of  breaking  strength  is 


X  ,  .  .  .  .X 

0.39585  1  -  ... 

0.56777  0.49365  1  -  -  - 

0.54682  0.46214  0.75986  1 

0.51409  0.44881  0,76754  0.69301  1 

\  0.55055  0.49229  0.86517  0.77387  0.79156  1  J 

Using  a  maximum  likelihood  factor  analyst;  algorithm,  the 
closest  1 -factor  model  is  found  to  be 


where  pi,  ,  pk  are  the  treatment  effects,  p  is  the  common 
slope  for  the  covariate  X,  X..  =  II  X,;,/!  n.a,  and  eta  are 
iid  N(0,o2)  with  unknown. 

It  can  be  verified  that  if  the  treatment  effects  pi,  ,  pk 
themselves  are  of  interest,  then  (3)  holds  so  exact 
simultaneous  confidence  intervals  can  be  computed  by  our 
method.  If,  on  the  other  hand,  treatments  versus  control 

effects  Pi-Pk,  ■■■  <  Pk-l-Pk  are  of  interest,  then  the  critical 
value  Iql  has  to  be  approximated.  The  following  example 
demonstrates  both  possibilities. 


6.1  .Starch  example 

-Scheffe  (1959,  p.216)  gives  breaking  strength  (y)  In 
grams  and  thickness  (x)  in  lO’^  inch  from  tests  on  7  types  of 
starch  film.  (.Starch  1  =  ('anna.  2  =  Sweet  Potato.  3  =  Corn. 
4  =  Rice,  5  =  Dasheen,  6  =  Wheat,  7  =  Potato)  It  is 
assumed  that  the  regression  coefficient  of  y  on  x  is  .he  same 
for  all  starches. 

First  suppose  the  pfs  themselves  are  of  interest  Using 
Factor  Analysis  algorithms,  the  2.,'s  in  (3)  for  the  correlation 
matrix  Rof(Ai.  ..  ,  (l7)'  are  found  to  he 


'' 

^0.5782 180  \ 
0. 2092954  ' 
-0.4571517 
-(),()48.3264 
-0.3582556 
-0.76.3622.3 


Exact  critical  values  based  on  these  /(,'s  and  86  d.f  for  MSF 
for  this  data  are  then  computed  by  our  algorithm,  which 


/0. 79025  Z;  8 

.4).6r'’85 

0.84280  Z2  ' 

0.53822 

0.389.54  Za 

0.92101 

0.55876  Za 

4- 

0.82933 

0.54342  Zs 

0,83946 

V0.35255  Zf,J 

^0,935797 

which  leaves  a  residual  correlation  mainx  of 


/  « . \ 

0.06604  0  .  -  . 

().()03.39  -0.00206  0 

0.03862  0.01577  -0.00.397  0  - 

-().0(K)32  -0.00301  -().0()5(,2  -0.00310  0 

1^-0.022,89  -0.(<:  ’  38  0.00.329  -0.00222  O.OOOOO  0  y 

with  an  average  root  mean  sc]iiare  olf-diagon.il  residual  of 
0.0214.3014. 

Based  on  this  1 -factor  approximation,  critical  s allies  kpl 
for  various  a  .,re  then  computed.  To  ch-ck  the  accuracy  ot 
these  apnroximate  critical  values,  again  40000  pairs  o( 

,  '{'■‘■’'i;  Ipi  -  l^il  /  sVv'ii  were  generated  based  on  R  and  Rj 
and  a  control-variate  technique  was  applied  to  reduce  the 
\  ariance  of  the  estimate  of  tnie  (t.  W.,  found 


Nonifnalij  Iqp  Lnb.Jd 'LJ'CXrne  ij  95'  i  ('I  ot-Tnic  u. 
O.IO  2.2()42  '5  0.10  0,000025  lO.Oddl ,  0. 1  DOS l 

0.0.5  2,S60(.3.)  0.05  -a  0  000175  (0  l)-)9(,,  0  0508) 

0.01  3.154000  O.Oi  0.000125  (O.OOdo.  00102) 


It  would  seem  that  the  1 -f.iclor  approximation  in  adeqii.re  for 
practical  piiqroses 

The  following  i.ible  comp. ires  the  confidence  intervals 
obtained  by  varioi's  methiKls. 
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Breaking  Strength 
Starch  7  -  Starch  I 
Starch  7  -  Starch  2 
Starch  7  -  Starch  3 
Starch  7  -  Starch  4 
Starch  7  -  Starch  5 
Starch  7  -  Starch  6 


Factor  Analysis  Cl 

58.94  ±  127.60 

71.36  ±  193.98 
119.01  ±  183.67 
146.08  ±  167.47 

174.90  ±  207.0,’ 

180,59  +  220.54 


Sidak's  Cl 

58.94  ±  134.22 

71.36  ±204.04 
119.01  ±  193.20 
146.08  ±  176.16 

174.90  ±  217.81 

180.59  ±231.98 


Scheffe's  Cl 

58.94  ±  181.29 

71.36  ±  :.75.61 
119.01  ±  260.96 
146.08  ±  237.94 

174.90  ±  294.21 

180.59  ±  313.35 


SAS  (Tiikevl  Cl 
181.14  ±  139.05 
265.43  ±  209.73 

493.60  ±  126.00 
437.25  ±  142.29 
563.74  ±  161.81 
667.68  ±  123.13 


Note  that  whereas  all  the  confidence  intervals  based  on 
BLUEs  cover  0,  none  of  the  SAS  confidence  intervals 
covers  0.  SAS  computes  confidence  intervals  with  the 
covariate  "thickness"  ignored.  As  Figure  1  shows,  the 
significant  differences  in  "starch"  detected  by  SAS  are  due 
mainly  to  the  differences  in  "thickness"  associated  with 
"starch." 

6.2  A  simulation  study 

A  small  simulation  was  performed  to  see  if  the  close 
approximation  to  R  by  Ri  above  for  the  real  data  example 
was  a  fluke.  Taking  k  =  10,  nj  =  •••  =  njo  =  10,  and  Xj,  — 
,  Xio  to  be  it'd  standard  normal,  100  correlation  matrices  R 

of  (Ai-Ak.  .  K-l-Ak)'  were  generated  using  PROC 
MATRIX  in  SAS.  First,  each  R  was  checked  to  see  if  it 
satisfied  (3).  None  did.  Then  for  each  R,  the  Maximum 
Likelihood  method  in  PROC  FACTOR  of  SAS  was  used  to 
find  the  closest  Rj,  and  the  root-mean- square  of  the  off- 
diagonal  elements  of  R  -  Ri  was  recorded.  The  mean  of  the 
100  root-mean -squares  was  0.004359595,  and  the  standard 
deviation  was  0.002385.  Their  stem-and-leaf  plot  is  given 
below. 

N  =  100  Median  =  0.00393708 
Quaniles  =  0.002674504, 0.00537431 

Decimal  point  is  3  places  to  the  left  of  the  colon 

1  : 124456679 
2:00012333334446667779 

3  :  1122234444566777788999 

4  : 000223445566677899 

5  ;  0001 13556678899 

6  :  014789 
7:26 

8  :  0001 
9  :  58 

High:  0.01429412  0.01523449 

7.  Two-way  Design  with  Missing  Observations 
Consider  the  two-way  no-interaction  model 

Yihr  =  P  +  Xi+7h  +  eihr. 

i  =  l,  -,a,  h  =  l,-  -,b,  r  =  l,---,nih, 

where  Yjhr  are  the  observations,  Tj,  ••• ,  Ta  are  the  treatment 
effects,  Yl  ■  ■  ■  >  7b  are  the  block  effects,  Cjhr  are  iid 
NormaUO.o^),  and  xi-Xa,  ,  Xa-i-Xa  are  of  interest.  If  the 
cell  sizes  nih  are  proportionate,  that  is,  njh  =  wjmh  for  all  i 
and  h,  then  (3)  is  satisfied,  but  not  generally  otherwise. 
However,  recall  (3)  is  always  satisfied  when  k  <  .3,  and  the 
following  example  illusuates  this. 

7.1  Blood  example 

We  use  a  data  set  popular  for  illustrating  two-way 
unbalanced  ANOVA  (Fleiss,  1986,  p.l66;  SAS  User’s 


Guide:  Statistics,  1985,  p.492).  Of  interest  was  the  increase 
in  systolic  blood  pressure  in  dogs  after  treatment,  with 
disease  as  a  blocking  factor.  The  sample  means  and  sample 
sizes  (in  parentheses)  are  given  in  the  following  table. 


Disease 

1 

Treatment 

2  3 

4 

1 

29.333  (6) 

28.000  (8) 

16.333  (3) 

13.600  (5) 

2 

28.250  (4) 

33.500  (4) 

4.400  (5) 

12.833  (6) 

3 

20.400  (5) 

18.167  (6' 

8.500  (4) 

14.200  (5) 

Using 

a  Factor  Analysis  a. 

iihm,  X  is 

found  to  be 

(0.69567,0.69901,0.64584)',  from  which  exact  critical 
values  were  computed.  The  following  table  compares  the 


critical  values  given  by  the  various  methods. 


Scheffe 

Bonferroni 

Sidak 

Exact 

a  =  0.10 

2.565 

2.187 

2.172 

2.119 

a  =  0.05 

2.889 

2.474 

2.468 

2.427 

a  =  0.01 

3.542 

3.077 

3.076 

3.052 

7.2  A  simulation  study 


A  small  simulation  was  performed  to  see  how  well  can 
Rj  approximate  R  for  larger  k.  In  particular,  we  took  a  =  b 
=  10,  nil  =  —  =  nio.io  =  1^-  and  consider  the  missing 
completely  at  random  case  (c.f.  Little  and  Rubin,  1987, 
p.l4),  where  independently  each  observation  has  a  0.5 
probability  of  missing.  Using  PROC  MATRIX  in  SAS,  100 

correlation  matrices  R  of  (tf-Xa,  "■  >  ta-^i-Xa)'  were 
generated.  First,  each  R  was  checked  to  see  if  it  satisfied 
(3).  None  did.  Then  for  each  R,  the  Maximum  Likelihood 
method  in  PROC  FACTOR  of  SAS  was  used  to  find  the 
closest  Ri.  and  the  root-mean-square  of  the  off-diagonal 
elements  of  R  -  Ri  was  recorded.  The  mean  of  the  100 
root-mean-squares  was  0.001343971,  and  the  standard 
deviation  was  0.00030142.  Their  stem-and-leaf  plot  is 
given  below. 

N=100  Median  =0.00134159 
Quaniles  =  0.001 12995,  0,00 '558005 

Decimal  point  is  4  places  to  the  left  of  the  colon 

6  :  25 
7:9 
8:05 
9:  13889 

10  : 0123445678 

11  .011.33334667899 

12  : 0012333457889 

13  :  12355567789 

14  :  11122.356789999 

15  :  3357889 
16:0012366 

17  :  05567 

18  :  223577 

19  :  1 
20:4 
21  :0 
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8.  Vector  Processing 

It  was  relatively  straight  forward  to  code  the  proposed 
algorithm  so  that  it  will  t^te  advantage  of  vector  processing 
capability  when  available.  On  a  Cray  X-MP  with  the  cft77 
compiler  for  example,  execution  is  at  least  five  times  faster 
with  the  vector  processing  capability  turned  on  as  compared 
to  optionally  using  the  scalar  processing  capability  only. 
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ASSESSMENT  OF  PREDICTION  PROCEDURES 
IN  MULTIPLE  REGRESSION  ANALYSIS 

Victor  Kipnis,  University  of  Southern  California 


1.  Introduction.  As  opposed  to  the  traditional  infer¬ 
ence  based  on  a  priori  specified  model,  the  main  feature 
of  the  modern  regression  analysis  is  model  building  with 
regard  to  some  specified  regression  goals.  This  process 
usually  involves  some  scrutinizing  of  both  the  available 
data  and  a  set  of  potential  equations  before  settling  on 
the  final  version  of  the  model.  Such  examples  of  this 
strategy  as  examination  of  residuals,  checking  standard 
assumptions  of  homoscedasticity  and  of  absence  of  ce- 
rial  correlation,  analysis  of  outliers,  regression  diagnos¬ 
tics,  sorting  out,  or  transforming  data,  choosing  form  of 
the  equation,  selecting  explanatory  variables,  etc.,  con¬ 
stitute  what  is  usually  called  exploratory,  or  else,  data- 
analytical  approach  to  modelling.  Exploratory  methods 
are  often  used  iteratively  and  in  altei  nation  with  fit¬ 
ting  of  tentative  models  and,  thus,  make  contemporary 
regression  analysis  a  complex,  multistep,  iterative  pro¬ 
cess.  Ir.  practice,  when  all  this  activity  is  carried  out 
using  and  reusing  the  same  data,  conventional  means 
of  inference  both  at  the  intermediate  steps  and  for  the 
finally  chosen  model  could  be  misleading,  sometimes 
badly  so  (e.g.  Freedman,  1983;  Lovell,  1983;  Miller, 
1984;  Pinrker  et  al.,  1985,  1987). 

This  paper  concentrates  on  the  evaluation  of  ex¬ 
ploratory  procedures  for  prediction,  which  is  one  of  the 
major  goals  in  applied  regression  analysis.  In  this  case 
model  building  is  usually  reduced  to  selection  of  the 
‘most  efficient’  predictor  among  the  class  of  potential 
regression  equations,  that  is,  one  that  provides  the  min¬ 
imum  mean  squared  error  of  prediction  (MSEP).  There 
is  a  good  deal  of  literature  on  different  selection  proce¬ 
dures  with  regard  to  their  computational  (logical) 
scheme  (e.g.  see  a  review  in  Hocking,  1976),  but  much 
less  has  been  done  concerning  analysis  of  statistical 
properties  of  thereby  selected  predictors.  As  was  re¬ 
peatedly  pointed  out  (e.g.  Berk,  1978;  Hjorth,  1982; 
Miller,  1984)  the  theory  behind  the  conventional  MS  JP 
estimators  is  not  valid  when  predictor  selection  and  es¬ 
timation  are  from  the  same  data.  The  very  selection 
process  introduces  considerable  distortion  into  the  dis¬ 
tribution  of  these  estimators  and,  in  particular,  leads  to 
their  substantial  bias  when  the  selection  effect  is  not  al¬ 
lowed  for.  The  present  paper  attacks  this  problem  along 
the  same  lines  as  in  (Kipnis,  Pinsker,  1983;  Pinsker  ct 
al.,  1985;  Kipnis,  1987). 

We  bring  in  the  ‘procedural  approach’  and  suggest 
that  assessment  of  the  efficiency  of  any  predictor  should 
rest  on  the  assessment  of  the  procedure  by  which  this 
predictor  has  been  chosen,  rather  than  the  evaluation  of 
any  particular  prediction  equation.  As  exact  distribu¬ 
tional  results  are  virtually  impossible  to  obtain,  even  for 
relatively  simple  selection  procedures,  it  is  suggested  to 
estimate  procedural  performance  in  a  simulation  study 
by  generating  bootstrap  pseudosamples  and  applying  to 
them  the  same  regression  procedure  that  was  used  for 
the  original  data. 

For  illustrative  purpo^...,,  prediction  procedures  bciscd 
on  subset  selection  methods  are  considered.  It  should 
be  emphasized,  though,  that  the  general  concepts  of 
the  present  study  apply  to  any  other  data-analytical 
procedures  for  model  building. 

2.  Problem  Formulation.  Procedural  Approach. 
Consider  the  linear  regression  model 

K  =  f  e  -  X/7  + f,  (1) 


where  Y  =  (Y*, . . . ,  K")'  is  a  n-vector  of  observations 
on  the  response  variable,  X  =  [Yj, . . . ,  X  .i  is  n  x  k  full 
rank  matrix  of  observations  on  each  of  the  k  predictor 
variables,  =  (^*, . . .  ,^*‘)'  is  a  fc-vector  of  unknown 
coefficients,  e  =  (f', . . . ,«")'  is  an  n-vector  of  unobserv¬ 
able  disturbances,  e  ~  A(0,cr^/n).  The  Y’s  are  assumed 
fixed. 

The  given  n>  k  observations  on  the  response  and  the 
predictor  variables  constitute,  we  suppose,  a  construc¬ 
tion  set  V  =  (Xv,Yv).  Let  IF  =  (Yw,  Yw)  be  a  new 
set  of  nw  observations  bcised  on  the  same  model  (1): 

Yw  =  Y(v  -f-  £w  =  XwP  +  twi 

where  tw  is  independent  from  tv-  We  will  call  W  the 
‘target’  set.  Given  the  construction  set  V  and  the  ma¬ 
trix  Xw ,  the  regression  goal  is  to  predict  vector  Yw 
with  some  predictor  Yw  based  on  a  p-subset,  p  <  k,  of 
the  predictor  variables.  We  assume  that  this  predictor 
is  selected  among  the  class  of  potential  predictors  (say, 
all  possible  subsets)  by  applying  to  the  data  V  some 
prediction  procedure  g. 

Without  providing  any  strict  formalization,  by  ‘pro¬ 
cedure’  we  will  understand  a  mapping  from  the  input 
data  to  the  output.  In  the  present  case  the  input  con¬ 
sists  of  the  construction  set  Y,  matrix  Xw,  the  class 
of  potential  predictors,  and,  perhaps,  some  criteria  for 
choosing  a  particular  subset  and  estimating  the  corre¬ 
sponding  parameters.  The  output  includes  the  chosen 
subset  of  predictor  variables,  estimates  of  its  parame¬ 
ters,  and  vector  Yw.  Roughly  speaking,  all  operations 
performed  on  the  construction  set  V  to  get  the  output, 
the  entire  exploratory  process,  is  covered  by  the  use  of 
the  term  procedure. 

It  should  be  stressed  that  procedure  g  is  conceived 
of  as  a  separate  whole,  a  distinct  statistical  entity,  in 
spite  of  the  fact  that  various  nonstatistical  considera¬ 
tions  (experience,  professional  intuition,  simplicity  of 
computation,  etc.)  may  and  in  fact  do  influence  its 
choice.  The  relationship  between  procedure  for  model 
building  and  built  up  model  reminds  one  between  es¬ 
timator  and  estimate.  Just  like  estimate  is  simply  a 
number,  a  selected  model  is  characterized  by  a  realized 
vector  of  estimated  parameters.  Its  justification  should 
rest  on  an  assessment  of  the  procedure  which,  by  anal¬ 
ogy  with  estimator,  is  the  ‘recipe’  or  ‘selection  strategy’, 
or  ‘algorithm’  by  which  the  data  are  transformed  into 
an  actual  model.  As  a  result  of  such  conceptualization, 
procedure  becomes  an  object  of  statistical  study,  allows 
statistically  valid  evaluation  and  comparison  of  different 
procedures,  and,  thus,  belongs  to  the  sphere  of  formal 
inference. 

Turning  back  to  subset  selection,  a  typical  prediction 
procedure  selects  a  p-subset  Xp  =  |X,i, . . . ,  A^p]  and 
yields  the  r:w-vector  Yw  =  Xw/J  based  on  the  OLS 
fitting  of  the  selected  subset.  If  the  full  set  of  regressors 
is  not  selected,  some  of  the  components  of  the  realized 

vector  P  are  zero. 

It  is  important  to  emphasize  that  the  chosen  predictor 
Yw  does  not  pretend  to  represent  the  ‘real’  model  by  in¬ 
cluding  all  the  significant  variables  and  excluding  all  the 
nonsignificant  ones  (with  fij  —  0).  Moreover,  procedure 
g  selects  a  random  subset  of  predictor  variables  which 
may  vary  for  different  realizations  of  V .  Because  of  that 
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fact,  evaluation  of  a  predictor  Yw  =  g{Y',Xw]  should 
be  based  on  the  assessment  of  the  selection  procedure  g 
rather  than  any  particular  model  (selected  subset)  cho¬ 
sen  for  the  given  realization  of  the  construction  set. 
Consider  prediction  error 

ew  =  Yw  -Yw  =  g{Y ;  Xw)  -  Yw  (2) 

Since  Xv  and  Xw  are  assumed  fixed  and  known,  below 
we  will  use  notations  Yw  =  gw{Yv)-  Let  us  split  ew 
into  a  nonrandom  and  a  random  parts: 

where 

e°w  =  Y^-Y^  =  gw(Y°)~Y°  (3) 

is  an  error  of  predicting  the  nonrandom  part  Y^  of  Yw 
by  applying  procedure  g  to  the  nonrandom  part  = 
(Xv,Yy)  of  the  construction  set  V.  From  (2)-(3)  it 
follows 

6ew  —  ew  “  e^  =  6gw  —  tw  (4) 


The  most  obvious  estimator  of  MSEP  is  the  apparent 
losses  or  autolosses 


AL(g-,V)  ^  i(Yv -YvYiYv -Yv) 

=  ^(gviYv)  -  YvY{gv{Yv)  -  Yv) 


(8) 


that  rr-asure  goodness-of-fit  of  the  procedure  g  on  the 
construction  set  V.  For  most  procedures  AL  tends  to 
underestimate  MSEP,  because  the  same  data  have  been 
used  for  both  construction  and  evaluation.  This  is  a 
faniiliar  fact  that  couiu  be  easily  demonstrated  when 
the  procedure  g  consists  of  OLS  fitting  of  an  a  priori 
specified  subset  Xvp,  and  Xw  =  Xy,  i.e.,  when  we 
predict  new  observations  Yw  =  Xv0  +  tw  for  the  same 
set  Xy  of  explanatory  variables.  For  future  references 
we  will  denote  this  procedure  by  gp.  We  have 


!7p(Fv)  =  PpYy,  (9) 


where  6gw  =  9w{Yy)  -  gw{Y^)-  For  the  quadratic  loss 
function 

Lw{g-,V)  -  - e'w«w  (5) 

nw 

two  distinct  risk  functions  are  important  measures  of 
efficiency  of  the  procedure  g.  The  first  one  is  the  con¬ 
ditional  MSEP  for  a  fixed  vector  Yy  (or  the  fixed  con¬ 
struction  set  V ,  since  Xy  is  fixed  anyway). 

fl(g-,yv)  =  MS£P(gw(yv)  1  Vy)  =  £lLw(g;V}  1  V] 

It  follows  from  (2)-(5)  that 

/i{g;Yy}  =  +  —(g^(Yy)  -  y^)'(gw(yv)  - 

nw 

(6) 

Note,  that  P(g;Yy)  is  a  random  variable  together  with 
Yy  and  could  be  considered  as  measuring  both  the  pre¬ 
dictive  ability  of  a  selected  model  and  the  efficiency  of 
the  selection  procedure  g  for  a  given  construction  set  V. 

To  be  able  to  analyze  existing  procedures  and  to  invent 
new  ones,  one  needs  to  know  their  statistical  properties 
for  different  realizations  of  V .  One  such  characteristic 
is  the  unconditional  MSEP,  i.e.,  the  average  risk  over 
all  possible  construction  sets: 

R(g)  =  EY,[R(g-,Yy)] 

=  + —Ee'wEew  + —tr(VAR[6gw]), 

nw  nw 

where  V AR\8gw\  ^  E[(6gw  -  E6gw){Sgw  -  E6gwY] 
is  the  variance  -  covariance  matrix  of  6gw-  As  opposed 
to  the  conditional  risk  (6),  the  unconditional  risk  (7) 
measures  the  average  efficiency  of  the  selection  proce¬ 
dure  g,  not  of  a  selected  model,  which  depends  on  the 
realization  of  V . 

There  is  no  full  agreement  in  the  literature  on  whether 
conditional  or  unconditional  MSEP  is  more  appropri¬ 
ate  for  prediction  assessment.  The  traditional  statistics 
have  been  originally  proposed  as  estimators  of  the  un¬ 
conditional  MSEP  (e.g..  Hocking,  1976),  but  they  seem 
now  to  be  looked  upon  as  estimators  of  the  conditional 
risk  (7)  (see,  Hjorth,  1982;  Efron,  1983;  Picard  and 
Cook,  1984).  Below  we  will  consider  statistical  proper¬ 
ties  of  different  estimators  with  regard  to  both  measures 
(6)  and  (7). 


where  Pp  =  Xyp(XypXyp)~^ Xyp  is  the  projection  ma¬ 
trix  onto  the  linear  space  spanned  by  the  column-vectors 
of  the  matrix  Xvp.  In  this  case  Y^  =  Xyp  =  Yy,  = 

Note,  that  R(g;Yy)  is  a  random  variable  together  with 
Yy  and  could  be  considered  as  measuring  both  the  pre¬ 
dictive  ability  of  a  selected  model  and  the  efficiency  of 
the  selection  procedure  g  for  a  given  construction  set  V . 

To  be  able  to  analyze  existing  procedures  and  to  invent 
new  ones,  one  needs  to  know  their  statistical  properties 
for  different  realizations  of  V.  One  such  characteristic 
is  the  unconditional  MSEP,  i.e.,  the  average  risk  over 
all  possible  construction  sets: 

fl(g)  =  £;y^[/?(g;yv)] 

=  ^2  +  J-Ee'wEew  +  -^tr{V  ARlSgw]), 
nw  nw 

where  V AR[6gw\  =  jF[(5gw  -  E6gw){6gw  -  ESgwY] 
is  the  variance  -  covariance  matrix  of  dgw-  As  opposed 
to  the  conditional  risk  (6),  the  unconditional  risk  (7) 
measures  the  average  efficiency  of  the  selection  proce¬ 
dure  g,  not  of  a  selected  model,  which  depends  on  the 
realization  of  V . 

There  is  no  full  agreement  in  the  literature  on  whether 
conditional  or  unconditional  MSEP  is  more  appropri¬ 
ate  for  prediction  assessment.  The  traditional  statistics 
have  been  originally  proposed  as  estimators  of  the  un¬ 
conditional  MSEP  (e.g..  Hocking,  1976),  but  they  seem 
now  to  be  looked  upon  as  estimators  of  the  conditional 
risk  (7)  (see,  Hjorth,  1982;  Efron,  1983;  Picard  and 
Cook,  1984).  Below  we  will  consider  statistical  proper¬ 
ties  of  different  estimators  with  regard  to  both  measures 
(6)  and  (7). 

The  most  obvious  estimator  of  MSEP  is  the  apparent 
losses  or  autolosses 


AL[g-,V)  ^-(Yy-  YvY(Yy-Yy) 
n 

=  -{gv(Yy)-YyY{gy(Yy)-Yy) 
n 


(8) 


that  measure  goodiicss-of-fit  of  the  procedure  g  on  the 
construction  sr*  V.  For  most  procedures  AL  tends  to 
underestimate  MSEP,  because  the  same  data  have  been 
used  for  both  construction  and  evaluation.  This  is  a 
familiar  fact  that  could  be  easily  demonstrated  when 
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the  procedure  g  consists  of  OLS  fitting  of  an  a  priori 
specified  subset  Xvp,  and  Xw  —  Xy,  i.e.,  when  we 
predict  new  observations  Yw  =  Xv0  tw  for  the  same 
set  Xy  of  explanatory  variables.  For  future  references 
we  will  denote  this  procedure  by  gp.  We  have 

gp(Yy)=PpYy,  (9) 

where  Pp  =  Xyp{Xy^Xyp)^^Xyj,  is  the  projection  ma¬ 
trix  onto  the  linear  space  spanned  by  the  column-vectors 
of  the  matrix  Xyp.  In  this  case  Y^  =  Xyfi  =  Yy,  = 

ey,  E(gp{Yy))  =  PpY^  =  gpiYy),  and  it  follows  from 
(7)-(8)  that 

^{3p)  =  -Cv-'ev  +  —{”■  +  P)  (10) 

n  n 

and 

AR(gp)  =  K)]  =  ~eyie^  +  ^(n  -p),  (11) 

so  that  AL  underestimates  MSEP  with  the  negative  bitis 
-  2pa^ln. 

Such  ‘adjusted’  forms  of  AR  as  Jp  (Rothman,  1968), 
Cp  (Mallows,  1973),  AIC  (Akaike,  1973),  and  other 
conventional  estimators  have  been  suggested  to  allow 
for  this  bias  or  'overoptimism’  of  the  self-evaluating 
autoloss  function.  These  statistics  are  asymptotically 
unbi2Lsed  under  certain  conditions,  the  major  assump¬ 
tion  being  that  g  is  based  on  fitting  a  subset  that  has 
been  chosen  independently  from  the  construction  data 
V.  Thus,  conventional  estimators  only  partly  adjust  AL 
with  regard  to  self-evaluation,  while  the  real  subset  se¬ 
lection  has  never  been  allowed  for.  As  a  result,  these 
estimators  still  carry  some  overoptimism  and,  in  their 
turn,  need  adjustment.  To  be  able  to  get  more  adequate 
estimators  of  the  prediction  risk,  one  has  to  study  the 
distribution  of  Yw  under  that  very  procedure  g  which 
has  yielded  this  predictor. 

3.  Pseudosample  Method.  The  idea  is  to  analyze 
statistical  properties  of  predictor  Yw  by  applying  pro¬ 
cedure  g  to  data  generated  by  a  known  random  mech¬ 
anism.  The  main  requirement  is  that  this  mechanism, 
or  as  we  will  call  it,  pseudomodel,  should  simulate  the 
unknown  model  (1).  In  other  words,  it  should  gener¬ 
ate  pseudosamples  that  are  ‘close’  to  the  real  ones  with 
regard  to  their  statistical  structure. 

Consider  the  maximum  likelihood  estimator  of  the 
model  (l) 

Y  =  Xfi  +  (,  c~IV(0,d*/„),  (12) 

where  0  =  (X'yXy)-^X'yYy  and  =  y;(/„  -  Pk)Yy 
/(n-k)  are  the  OLS  estimates  of  the  parameters  based 
on  the  set  of  all  the  explanatory  variables.  The  deci¬ 
sion  to  use  the  ‘ful’  estimated  model  (12)  and  not,  say, 
the  subset  selected  by  the  procedure  g  is  made  because 
our  goal  here  is  simulation  and  not  prediction.  By  us¬ 
ing  unbiased  ML  estimators  of  the  parameters  0  and 
o*,  we  hope  to  get  a  pseudomodel  that  is  as  close  to  the 
real  one  as  possible.  Note,  that  this  choice  is  not  manda¬ 
tory  for  the  suggested  approach,  and,  in  principle,  other 
pseudomodels  could  be  used  in  different  situations.  In 
the  present  czise  a  pseudomodel  (12)  is  the  parametric 
bootstrap  model  as  described  in  (Efron,  1982). 

Consider  now  a  pseudoconstruction  sample  V  = 


(Xy,Yy)  and  a  pseudotarget  sample  W  =  [Xw,Yw), 
where  Yy  and  Yw  are,  respectively,  nx  1  and  nw  x  1  ran¬ 
dom  vectors  independently  generated  from  model  (12). 
Applying  the  same  selection  procedure  g  that  is  used  for 
the  original  construction  set  V  to  the  pseudosample  V, 
we  get  a  pseudopredictor  Yw  =  9w(Yv)  and  a  vector 

=  Yw  —  Yw  of  pseudoerrors.  As  pseudomodel  (12) 
is  completely  known,  we  can,  at  least  in  principle,  an¬ 
alyze  the  distribution  of  ew  and  use  its  characteristics 
as  estimates  of  their  counterparts  for  the  distribution  of 
the  real  vector  ew- 

One  possible  approach  to  deriving  pseudosample  es¬ 
timators  is  as  follows.  Instead  of  directly  estimating 
MSEP{g)  with  the  corresponding  pseudorisk,  let  us 
evaluate  overoptimism  of  the  autolosses  AL(g)  in  or¬ 
der  to  make  an  appropriate  adjustment.  At  least  two 
choices  present  themselves  for  representing  average 
overoptimism:  the  difference 

=  -Riff)  - 

and  the  ratio 

=  Rig)/AR[g) 

We  will  estimate  Q*(g)  and  Q^(g)  with  their  pseudo¬ 
counterparts 

Q^ig)  =  «(<?)  -  A^g)  =  E^[L^{g,V)  -  AL(g,V)] 
and 

Q^(g)  =  R(g)/AR{g) 

^  E4Lw{g,V)]/E^[AL{g,V)], 

where  E-  indicates  expectation  with  respect  to  the  ran¬ 
dom  mechanism  (12).  In  these  formulas,  the  construc¬ 
tion  sample  V  is  held  fixed.  This  yields  the  following 
two  pseudosample  estimators  of  MSEP:  the  additive 
estimator 

R^  =  AL[g)  +  Q'^(g) 
and  the  multiplicative  one 

R^  =  AL[g)Q^{g) 

The  re2ison  behind  these  estimators  is  to  get  more  piv¬ 
otal  statistics  as  compared  to  the  direct  estimator  R{g). 
4.  Linear  Procedures.  If  procedure  g  is  simple 
enough,  pseudosample  estimators  could  be  studied  an¬ 
alytically.  Consider  an  important  case  when  g  is  linear 
with  respect  to  Yy: 

gw(Yy)  =  GwYy, 

where  Gw  is  nw  x  n  matrix.  As  Ecw  =  f-w 
V AR\5gw]  =  a^GwG'w^  d  follows  from  (7)  -  (8)  that 

R(3)  =  <7*  H - Cw'^w  ^ - tT(GwGw)  (13) 

nw  nw 

AR(ff)  =  -tylty  +  — tr[(Gv  -  In){Gv  -  fn)*]  (M) 

n  n 

In  a  similar  way,  from  model  (12)  we  have 
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R(g)  —  -h - e^ie^  -) — - — tr{^GwG[y)  (15) 


AR{g)  =  U°yH°v  +  ^ir[{Gv  -  In){Gv  -  /«)'],  (16) 

where  =  (Gy  -  ln)XvP  and  ??y  =  [Gw  -  Ir.)Xy3. 
Comparing  expressions  (13)  -  (Hj  and  (15)  -  (16)  we 
get 

Theorem  1.  If  j  is  a  linear  procedure,  and  if  3  and  a'^ 
are  unbiased  estimators  of  3  and  respectively,  R^  is 
an  unbicised  estimator  of  the  prediction  risk  MSEP. 

The  multiplicative  estimator  still  remains  biased, 
except  for  the  important  caise  when  Cv"  ~ 
follows  from  the  definition  (3),  represents  ‘bias’  in 
applying  procedure  g  to  the  nonrandom  data  V°.  We 
will  call  it  o-bias.  Any  procedure  that  does  not  have  o- 
bias  is  called  o-adequate.  o-adequacy  means  the  prop¬ 
erty  to  build  up  ‘true’  model  on  the  ‘faultless’  data. 
Putting  Cy ,  ,  «5i,  to  zero  in  formulas  (13)  -  (16), 

we  have 

Theorem  2.  If  g  is  a  linear  o-adequate  procedure,  R’^ 
is  an  unbiased  estimator  of  the  prediction  risk  MSEP. 

Consider  again  an  example  of  procedure  gp  (9)  when 
Xw  =  Xv-  Here  Gw  =  Gy  =  Pp,  so  that  it  follows 
from  (15)-(16)  that 

R^  =  -(RSSp  +  2pa‘^) 
n 

As  d*  =  d*  is  the  mean  residual  sum  of  squares  for 
the  full  model  (1),  R''-  coincides  with  the  traditional 
adjusted  (unsealed)  estimator  of  MSEP  (e.g.  Hocking, 
1976) 


R‘' =  -{RSSp  +  2pa^)  =  -a^Cp  +  n)  (17) 
n  n 


If,  in  addition,  the  fitted  subset  represents  the  true 
model,  and  we  use  this  information  to  estimate  3  in 

(12)  by  =  {XypXvp)~^ X'y^Yy ,  R^  coincides  with 
another  adjusted  estimator 


Jp  = 


RSSp  / n  +  p\ 
n  [n-pj 


(18) 


5.  Nonlinear  Procedures.  Empirical  Results. 
When  procedure  g  involves  selection  it  becomes  nonlin¬ 
ear,  and  distributional  results  for  pseudooveroptimism 
are  virtually  impossible  to  obtain  analytically.  In  this 
case  Q'*  and  must  be  evaluated  by  Monte-Carlo: 
independent  pseudosamples  (Pi,  Wi), . . . ,  (Vjv,iYjv)  are 
generated  by  the  known  pseudomodel  (12),  and  for  each 
V,  the  pseudopredictor  ffwl^v't)  is  calculated.  Here  Xy 
and  Xw  rernain  the  same  as  in  the  observed  data.  By 

comparing  Vwi  and  Vw;  a  pseudoerror  ewi  is  calcu¬ 
lated.  This  gives  pseudoreplications  Lw(ff,K) 
~  and  AL(g,Y,]  =  ^e'v,ev».  Finally,  we  ap- 

proximate  and  by  the  averages  T  [L^y[g,Vi) 

-  AL{g,V,)\  and  ^  E  (?,  P,)/ i  E  AL(!7.  P.) 
respectively. 


To  illustrate  the  effect  of  subset  selection  on  tradi¬ 
tional  estimators  and  to  compare  these  estimators  with 
the  pseudosample  ones,  the  following  simulation  study 
was  conducted.  In  all  the  experiments  the  simulated 
data  satisfy  model  (1),  where  Xy  is  orthonormal  with 
fi  =  50  rows  and  fc  =  15,25,35  columns,  o*  =  1,  and 
Xw  =  Xy.  As  was  pointed  out  in  (Miller,  1984),  the 
orthogonal  case  gives  an  example  of  intermediate  cor¬ 
ruption  of  traditional  estimators  under  subset  selection. 
The  procedure  g  represents  the  method  of  all  possi¬ 
ble  regressions  (Hocking,  1976)  and  consists  of  screen¬ 
ing  all  2*  subsets  and  selecting  the  ‘best’  one  with  re¬ 
gard  to  the  minimum  Gp  (or  FP’’)  criterion.  The  two 
pseudosample  estimators,  R'^-  and  ,  were  compared 
with  the  ♦wo  traditional  estimators:  li'  (17)  and  Jp 
(18).  Two  values  for  the  true  vector  3  were  considi-ren 
3i  =  (0,0, . . .  ,0)',  which  represents  the  model  with  no 
significant  variaoles,  and  32  =  (7.0, 5.0, 0, . . .  ,0)'.  The 
second  model  has  two  very  significant  predictor  vari¬ 
ables  with  the  ‘signal  to  noise’  ratio  (/?‘)^/cr^  being  49 
and  25  respectively.  The  estimators  R^  and  R’^  were 
calculated  by  generating  100  pseudosamples  for  each 
simulated  data  set. 

A  summary  of  the  results  averaged  over  400  simu¬ 
lated  data  sets  is  given  in  Tables  1  and  2.  The  columns 
‘CMSE’  and  ‘UMSE’  report  the  mean  squared  error  of 
each  estimator  with  regard  to  the  conditional  and  un¬ 
conditional  risk,  respectively.  One  can  see  from  the  Ta¬ 
bles  1,  2  that  both  traditional  estimators  are  consid¬ 
erably  birised  downward,  especially  for  ratio  kin  close 
to  1  (A:  =  35),  when  their  bias  reaches  almost  50%  of 
the  actual  MSEP  value.  The  pseudosample  estima¬ 
tors  considerably  reduce  bias.  Moreover,  contrary  to  the 
traditional  estimators,  they  are  somewhat  ‘pessimistic’ 
by  slightly  overestimating  the  actual  MSEP.  On  the 
other  hand,  pseudosample  estimators  have  bigger  vari¬ 
ance  than  the  traditional  ones,  so  that  their  MSE  only 
slightly,  if  any,  better  than  M S E  oi  R*’’ ,  especially  when 
we  consider  conditional  risk.  R^  varies  a  little  less  than 
R^  and  has  the  lowest  MSE.  Jp  is  the  worst  of  all 
the  considered  estimators.  With  regard  to  the  uncondi¬ 
tional  risk,  both  R^  and  R^  outperform  the  traditional 
estimators  having  about  17-20%  lesser  MSE  than  JP'. 

Although  pseudosample  estimators  demonstrate  some 
better  results  than  the  traditional  ones,  especially  with 
regard  to  biasedness,  they  are  not  very  impressive.  One 
explanation  is  that  the  considered  procedure  g  involves 
very  extensive  search.  Thus,  being  very  ‘nonlinear’,  it 
becomes  very  sensitive  to  the  choice  of  =  X3  for  the 
pseudomodel  (12).  Although  is  an  unbiased  estima¬ 
tor  of  y®,  =  ||y°|l*  +  ka^,  so  that  pseudo¬ 

model  (12)  is  based  on  a  response  vector  with  a  much 
inflated  length. 

One  possible  way  of  coping  with  this  difficulty  is  split¬ 
ting  a  comprehensive  multistep  procedure  into  subpro¬ 
cedures  (intermediate  steps),  estimating  tentative  re¬ 
sults  for  each  of  them,  and  choosing  the  final  predictor 
b2iscd  on  these  estimates.  The  reason  behind  this  ap¬ 
proach  is  that  subprocedures  are  less  nonlinear  and  so 
may  be  less  sensitive  to  the  choice  of  pseudomodel  (12). 

Turning  back  to  our  example,  consider  subprocedures 
that  for  each  p  =  0,1,..., fc,  select  the  best  subset 
with  respect  to  Gp  among  all  possible  p-subsets.  The 
original  procedure  g  consists  of  consecutively  applying 
these  subprocedures  for  each  p,  estimating  MSEP  for 
each  selected  subpredictor,  and,  finally,  choosing  the 
best  one  that  corresponds  to  the  minimum  estimated 
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Table  1.  Simulation  Results  Irom  400  Trials  for  =  (0,  0, . . . ,  0) 


Estimators 

Mean 

It  =  15 
CMSE 

UMSE 

Mean 

It  =  25 

CMSE 

UMSE 

Mean 

It  =  35 

CMSE 

UMSE 

Actual  MSEP 

1.19 

0 

.016 

1.31 

0 

.029 

:.44 

0 

.042 

jp 

.90 

.140 

.121 

.83 

.315 

.272 

.72 

.622 

.553 

.91 

.133 

.114 

.86 

,281 

.241 

.79 

.523 

.462 

k* 

1.20 

.124 

.092 

1.48 

.264 

.184 

1.64 

.502 

.350 

1.30 

.131 

.098 

1.50 

.280 

.202 

1.67 

.528 

.382 

Table  2. 

Simulation  Results  from  400  Trials  for  P2  ~  ^  0, . . 

•,o)' 

Estimators 

Mean 

It  =  15 
CMSE 

UMSE 

Mean 

lt  =  25 

CMSE 

U.MSE 

Mean 

It  =  35 

CMSE 

UMSE 

True  MSEP 

1.21 

0 

.016 

1.33 

0 

.028 

1.46 

0 

.041 

Jp 

.95 

.130 

111 

.86 

.302 

.259 

.75 

.609 

.541 

R‘' 

.96 

.122 

.103 

.91 

.261 

.220 

.81 

.515 

.435 

k* 

1.28 

.120 

.090 

1.48 

.258 

.182 

1.65 

.496 

.350 

1.30 

.127 

.096 

l.uO 

.270 

.200 

1.68 

.524 

.381 

MSEP.  The  results  of  the  corresponding  simulation 
experiments,  based  on  the  same  model  specification  as 
above,  are  reported  in  (Kipnis,  1987).  Some  of  them  are 
reproduced  in  Table  3  that  contains  empirical  average 
and  CMSE  for  the  less  corrupted  of  the  two  traditional 
estimators,  fi*',  and  pseudosample  estimator,  for 
each  of  the  selected  subpredictors.  One  can  see  that  R*' 
is  still  considerably  biased  downward  when  p  exceeds  the 
true  number  of  non-zero  components  of  0,  but  less  than 
the  full  size  k.  What  is  even  worse,  R^'  does  not  fol¬ 
low  the  actual  MSEP.  Thus,  it  has  its  minimum  when 
p  -  4  (or  0  =  01  and  when  p  =  6  for  =  02.  On  the 
contrary,  as  expected,  the  pseudosample  estimator  R^ 
behaves  here  much  better  than  in  the  previous  case.  It 
not  only  considerably  reduces  bias,  but  also  has  much 
lesser  MSE  than  R^',  especially  when  p  is  not  too  close 
to  k.  Moreover,  it  matches  the  actual  MSEP  having 


smallest  values  when  p  =  0  for  the  first  model  and  when 
p  =  2  for  the  second  one. 

Table  4  reports  the  average  MSEP  for  the  final  pre¬ 
dictor  selected  among  all  the  best  suL;.r,;dictors  accord¬ 
ing  to  criteria  based  on  each  of  the  five  statistics:  actual 
MSEP^JpyR^'tR^y  and  R’^ .  It  follows  that  apply¬ 
ing  the  suggested  approach  at  the  intermediate  steps  of 
model  building  leads  to  substantially  better  final  pre¬ 
dictor  than  those  based  on  traditional  criteria. 

6.  Conclusion.  The  procedural  approach  consists  in 
evaluating  a  selected  predictor  by  assessing  the  selection 
procedure  which  has  yielded  this  predictor.  For  this 
purpose,  it  is  suggested  to  construct  a  pseudomodel,  to 
generate  a  necessary  number  of  pseudosamples,  and  to 
apply  to  each  one  of  them  the  same  selection  procedure 
that  is  used  for  the  original  data.  The  corresponding 
empirical  distribution  of  pseudoerrors  provides  all  char¬ 
acteristics  of  interest.  This  method  appears  general  and 


Table  3.  Subprocedures:  Simiilalion  Results  from  400  Trials;  u  =  50,  k  =  25 


p,  =  (0,0,. ...0)' 

Subset 

Size  True  fV 


P 

MSEP 

Mean 

CMSE 

Mean 

CMSE 

0 

1.00 

1.01 

.04 

1.01 

.04 

1 

1.11 

.94 

.06 

1.11 

.05 

2 

1.18 

.91 

.11 

1.18 

.07 

3 

1.24 

.89 

.16 

1  23 

.08 

4 

1.29 

.881 

.20 

1.27 

,09 

5 

1.32 

.882 

.24 

1.30 

.11 

6 

1.36 

.89 

.27 

1.33 

.12 

7 

1.38 

.90 

.29 

1.36 

.13 

8 

1.41 

.92 

.30 

1.38 

.14 

9 

1.43 

.94 

.31 

1.39 

.15 

10 

1.44 

.96 

.31 

1.41 

.16 

15 

1.50 

1.10 

.27 

1.45 

.19 

20 

1.5 1 

I  28 

.21 

1.47 

.20 

25 

1.52 

1.47 

.20 

1.47 

.20 

02  =  (7.0,  5.0,  0,0, ,  ..,0)' 


True 

MSEP 

k’' 

Mean  CMSE 

R' 

Mean 

CMSE 

2,48 

2.46 

.15 

2.46 

.15 

1.55 

1.49 

.08 

1.55 

.00 

1.05 

1.04 

,05 

1.10 

.06 

1.15 

.97 

.07 

1.16 

.07 

1.22 

.95 

,12 

j  00 

.08 

1.27 

.933 

.16 

1.26 

.10 

1.32 

.929 

,20 

1.50 

,11 

1.35 

.932 

.23 

1.33 

.12 

1.38 

.94 

,26 

1.36 

13 

1.41 

,96 

'>7 

1.38 

-14 

1.43 

.98 

.28 

1.40 

,15 

1.49 

111 

1  4.5 

1. 51 

1.28 

.21 

1,47 

.20 

1.52 

1.47 

.20 

1.47 

.20 
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Table  4.  Averajfe  MSEP  for  Predictor  Selected  by  Different  Criteria 


Criterion 

True 

MSEP 

J, 

R" 

Ave  MSEP  for 

=  A 

1.00 

1.39 

1.31 

1.07 

l.ll 

Selected  predictor 

0 

=  fl2 

1.04 

1  40 

1.33 

1.14 

1.17 

flexible  enough  and,  in  principle,  could  be  used  with  any 
selection  procedure.  When  procedure  is  rather  cornplcx 
(e.g.  includes  extensive  search),  it  becomes  sensitive  to 
the  choice  of  an  appropriate  pseudomodel  which  should 
be  ‘close’  enough  to  the  original  one.  One  way  of  cop¬ 
ing  with  this  problem  is  not  to  delay  the  assessment 
of  the  procedure  until  the  end  of  the  selection  process. 
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POSTERIOR  INFLUENCE  PLOTS 


Robert  E.  Weiss.  University  of  Minnesota 


Abstract 

The  posterior  influence  plot,  a  graphical 
case  influence  statistic  is  introduced.  It 
displays  the  entire  influence  of  an 
observation  on  the  posterior  distribution  of 
the  parameters  in  a  statistical  model.  The 
statistic  is  available  for  a  w/ide  class  of 
models  including,  but  not  restricted  to. 
linear,  nonlinear,  and  generalized  linear 
regression. 

1 .  Introduction. 

Diagnostics  are  statistics  that  aid  in  the 
identification  of  problems  \with  a  statistical 
analysis.  Specific  examples  from  linear 
regression  are  outlier  statistics,  leverage 
values,  influence  statistics  and  residual 
piots.  This  paper  develops  a  graphical  case 
influence  statistic  that  is  available  for  a 
wide  variety  of  models. 

An  influential  data  point  has  a  large 
effect  en  the  coiT'~lus'oe  of  an  analysis  In  a 
Bayesian  analysis,  the  conclusion  will  be  the 
posterior  p(e  |  Y)  of  the  parameter  vector  0 
given  Y,  the  full  data  vector.  Deleting  a 
single  case  from  the  analysis  changes  the 
posterior  from  p(e  |  Y)  to  p(e  |  Y(i)).  where 
Y(i)  denotes  the  reduced  data  vector.  The 
problem  of  influeniial  case  analysis  is  to 
compare  the  two  densities  p(0  |  Y)  and 
P(elY(i)). 

The  best  solution,  if  possible,  is  to  plot 
the  two  densities  on  the  same  graph.  This 
is  feasible  if  0  is  only  one  or  two 
dimensions,  but  what  can  be  done  if  0  is 
high  dimensional?  This  paper  finds  a  one  or 
two  dimensional  function  Z[  =  ri(0.Xi)  that 
encompasses  all  of  the  influence  of  yi  on  the 
posterior.  The  posterior  influ_ence_BlQt  is  a 
simultaneous  display  of  the  posteriors 
p(ri  1  Y)  and  p(ti  |  Y(o).  one  plot  for  each 
observation. 

2.  An  aitftrnative  Bauesian  aMEggsTL 
Another  Bayesian  approach  to  influence 

analysis  is  to  reduce  the  comparisons 
between  p(0  |  Y)  and  p(0  j  Y(t))  to  a  one 


number  summary,  the  Kullback  divergence 
between  the  densities  (Johnson  and  Geisser 
1985.1982,1983:  Pettit  and  Smith  1985). 

The  problem  with  this  approach  is  the 
interpretation  of  the  resulting  numbers. 

How  big  is  big?  Also,  several  different 
posterior  configurations  can  produce  the 
same  numerical  value  of  the  influence 
statistic.  For  example,  in  linear  regression, 
a  low  leverage  high  outlier  observation  can 
have  the  same  value  of  an  influence 
statistic  as  a  high  leverage  observation 
which  is  not  outlying.  Which  configuration 
is  the  actual  cause?  The  best  influence 
statistic  would  be  a  plot  of  p{0  |  Y)  and 
p(0  I  Y(i))  for  all  0  in  R*^,  but  this  plot  is 
difficult  to  draw  in  general.  A  good  graphic 
should,  however,  capture  much  more 
information  than  just  a  single  number,  so,  in 
this  paper  we  attempt  to  find  a  low 
dimensional  function  of  0  that  captures  all 
of  the  influence  of  the  i^b  observation  on  the 
posterior. 

The  low  dimensional  function  of  0  can  be 
found  by  inspecting  the  sampling  density 
f(yj  1 0.  xi)  of  the  iti'  observation  yi.  where  Xi 
denotes  the  independent  variables.  In  miost 
models,  the  sampling  density  depends  only 
on  a  low  dimensional  function  ri(0.Xi)  and 
the  observation  can  influence  the  posterior 
of  0  only  through  its  influence  on  the 
posterior  of  Z\. 

This  function  depends  on  the  particular 
case  and  on  the  model.  For  example,  in 
linear  regression,  with  known  variance. 

ti(0.Xi)  =  x,^0 

is  just  one  dimensional.  Knowing  Z\ 
determines  the  sampling  density  of  the  i^*^ 
observation. 

In  linear  regression  with  unknown  variance 

t:i{0,Xi)  :  (xje,  a) 
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is  two  dimensional.  In  nonlinear  regression 
with  mean  function  E[yi]  =  TiCe.Xj)  and 
unknown  variance.  Zi  is 

t:i(0.Xi)  =  (T\(0,Xi).  a). 

For  generalized  linear  models  (McCullagh  and 
Nelder  1983)  with  link  function  g(  ), 

rj(0)  =  g(x^e). 

These  models  cover  a  many  of  the  models  in 
use  in  statistics. 

In  section  3  a  downdating  version  of 
Bayes  theorem  is  presented  and  also  a 
marginal  version  of  Bayes  theorem.  The 
statement  that  the  observation  influences 
p(0  I  Y)  only  through  p(ti  1 0)  is  proved.  In 
section  4  the  Kullback  divergence  is 
introduced  as  an  influence  statistic,  and  a 
proof  is  given  that  the  Kullback  divergence 
between  the  8  posteriors  for  the  i^b  case 
depends  only  on  the  function  Zi(0.Xi).  This 
provides  a  second  proof  that  yi  only 
influences  p(0 1  Y)  through  p{zj  |  Y).  Finally  a 
nonlinear  regression  example  is  given  in 
section  5,  followed  by  discussion  in  section 
6. 


3.  A  gowndatinq  version  of  Baues  theorem. 
A  downdating  form  of  Bayes  theorem  is 

P(0  1  Y(i))f(yi  I  0.x,) 


p(0  Y) 


(1) 


Uyi|Y(i).Xi) 

where  Hy,  |  Y(i).Xi)  is  the  numerator 
integrated  over  the  range  of  0.  Equation  (1) 
can  be  used  in  either  of  two  directions,  to 
update  the  posterior  after  new  data  arrives, 
or  to  remove  an  observation  for  purposes  of 
sensitivity  or  influence  analysis. 

Change  variables  in  (1)  from  0  to  (ti.p,). 
where  r,  is  the  function  such  that  the 
sampling  distribution  fCy,  1 0,Xi)  is  equal  to 
f{y,  I  rtO.Xj)).  and  p,  is  chosen  to  make  the 
transformation  one  to  one.  The  posterior 
p(0  I  Y)  can  be  written  as  p(rj  |  Y)  p(pi  |  Y.r,) 
and  similarly  for  the  reduced  data  posterior. 
Then  (1)  can  be  rewritten 


p(ri  Y)  p(pj  Y.r,) 


•  1 

p{^i 

Y(i))p{p, 

Y(i).ri)f(yi  r,) 

r(y, 

Y(,).x,) 

(2) 


The  extra  parameter  p,  only  occurs  once  on 
each  side  of  (2)  and  can  be  integrated  out. 
giving  a  marginal  Bayes  theorem  for  the  r, 
parameter. 


P(^^i  I  Y) 


P(^i  I  Y(i))f(y,  I  r,) 
Hy,  I  Y(,).x,) 


Dividing  (2)  by  (3)  gives 

p(pi|Y.r,)  =  p(pi|Y(i),ti). 


(3) 

(4) 


Equation  (4)  says  that  the  ifh  observation 
has  no  effect  on  the  conditional  distribution 
of  p,  given  r,.  Thus  y,  only  has  an  effect  on 
Zi.  the  posterior  of  p,  given  r|  does  not 
depend  on  whether  y,  is  in  the  analysis. 
Rearranging  (1)  and  (3)  gives 

P(9  I  Y(i))  Hy,  |Y(i),X|)  p(ti  I  Y[i)) 
p(0lY)  M(y,|  r,{0.x,))  "  p(t,  |Y) 


The  equality  of  the  outer  two  ratios  will  be 
useful  in  the  analysis  of  the  Kullback 
influence  statistic. 


4.  The  Kullback  divergence  influence 
statistic. 

The  Kullback  divergence  between  the  full 
posterior  and  the  reduced  data  posterior, 


r  p(0  Y(i))  I 

K(i)(0)  =  J  log  I  Y)  p(0  I  Y(i))  d0.  (6) 

is  a  useful  generic  measure  of  the  influence 
of  the  i^b  observation  on  the  posterior  of  9. 
(See  McCuiioch  ( 1 985)  diiu  bornardo  ( 1  985, 
1979)).  The  notation  K(i)(e)  is  a  short  hand 
notation  to  say  that  K(i)  measures  the 
influence  of  the  i^b  case  on  the  posterior  of 
0.  The  case  statistic  K(,)  can  be 
conveniently  and  cheaply  computed 
numerically  provided  the  observations  are 
conditionally  independent  given  the 
parameter  vector. 

Equation  (6)  can  be  simplified  by  using 
equation  (5)  to  substitute  for  the  posterior 
ratio  inside  the  log,  changing  variables  from 
0  to  ('ci.pi)  and  integrating  out  p,.  producing 


K(i)(0) 


r  p(t|  Y(i)) 
p(t,  l  Y) 


pCr,  1  Y(,)) 


dr, 


s  K(,)(r,)  .  (7) 

The  Kullback  divergence  between  the  reduced 
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Figure  1 . 


data  and  full  data  0  posteriors  is  equal  to 
the  Kullback  divergence  between  the  reduced 
data  and  full  data  Zi  posteriors! 

Equation  (7)  depends  on  the  two 
posteriors  p(ri  |  Y(i))  and  p(ri  |  Y).  which  for  s 

a  wide  variety  of  models  will  be  densities  <k 

on  R’  or  r2.  The  plot  of  these  two  densities  h 

will  exhibit  all  of  the  influence  of  the  i^fi  g 

O 

point  on  0  since  the  conditional  distribution  s 

of  pi  given  z^  does  not  depend  on  the  i 

outcome  of  the  i^h  case. 


5.  An  example:  Bean  Root  Cell  data. 

Table  1  is  a  list  of  the  data  and  Figure 
(1)  IS  a  plot  of  the  bean  root  cell  data  from 
Ratkowsky's  (1983  p.  88)  book  on  nonlinear 
regression  modeling.  The  response,  y,  is 
water  content  in  10-®  g  (Heyes  and  Brown 
1956).  plotted  against  the  independent 
variable  x.  distance  in  millimeters  from 
growing  tip.  A  normal  theory  nonlinear 
regression  model  with  independent  errors 
was  used  by  Ratkowsky  to  analyze  this  data. 
The  mean  function  is  the  logistic  function 


0.x,]  : 


*exp(02-0jx,) 


and  the  variance  is  assumed  constant  and 
known 


Varly,]  :  .414016.  (8B) 

Ihe  Kullback  statistics,  K(i),  were  computed 
for  this  model  using  an  algorithm  based  on 
the  Iterative  Gauss  Hermite  quadrature 
methodology  of  Naylor  and  Smith  (1982). 
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Bean  Root  Data 

For  the  model  (8).  r,(0,Xi)  - 
1  .  expie,  - 

parameter  and  is  equal  to  the  sampling  mean 
of  the  ii^  observation.  That  t,  can  be  put 
onto  a  scale  with  a  physical  interpretation 
IS  typical  of  linear,  nonlinear  and 
generalized  linear  models.  For  this 
particular  nonlinear  model,  r,  is  length  in 
the  same  units  as  the  measurements  y,.  and 
consequently  the  data  analyst  can  use 
subject  matter  knowledge  to  help  decide  if 
the  observation  is  having  a  substantial 
impact  on  the  inference.  Because  of  the 
statistical  information  contained  in  the  plot, 
the  statistician  can  decide  if  there  are 
statistical  reasons  for  considering  an 
observation  to  be  highly  influential. 

Figures  2  through  5  show  four  examples 
of  posterior  influence  plots  for  the  bean 
root  cell  data.  The  influence  plots  are  for 
observations  8,  6,  14,  and  1  respectively. 
These  points  have  the  highest,  the  second 
highest,  the  median,  and  a  very  small  value 
of  the  Kullback  influence  statistic, 
respectively.  The  x-axis  is  water  content  in 
units  of  g  X  10  ®  and  each  plot  has  the  same 
scale  along  the  x-axis,  to  facilitate  visual 
comparisons  amongst  the  pictures.  Figures 
2.  3.  and  4  also  have  the  same  scale  on  the 
y-axis,  while  figure  5  has  a  scale  4  times 
the  others  to  accommodate  it's  large  peak. 

In  each  picture,  the  solid  line  is  the  full 
data  posterior  marginal  p(t||Y)  while  the 
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Figure  4. 
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Figure  3. 
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dotted  line  is  the  reduced  data  posterior 
marginal  p(ri  |  Y(i}). 

The  most  influential  observation 
according  to  the  Kullback  statistic  is  case 
number  8.  with  a  value  of  K(i)  2.7  times 
larger  than  the  next  largest  value.  Figure  2 
shows  that  mode  of  the  posterior  decreases 
by  approximately  .6  units  when  the 
observation  is  deleted.  That  the  precision  in 
the  posterior  decreases  is  also  visible 
because  the  mode  height  decreases  from 
approximately  1 .4  to  1 .2. 

Figure  3  shows  that  omitting  the 
observation  number  6  moves  probability 
from  lower  values  of  water  content  to 
higher  values  of  water  content.  Deleting  the 
observation  also  decreases  the  precision  of 
our  posterior. 

The  next  posterior  influence  plot,  figure 


O  _ _ 

Cvj 

- 

;  j 

o 


tT) 

O 

\ 

O  ^  _ _ •. _ 

195  20G  20  5  21  0  2i  5  22  C 

Water  Content 

Posterior  influence  plot  case  1 4 


Figure  5. 
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4.  shows  only  a  moderate  influence  by 
observation  14.  There  is  a  mild  location 
shift  and  a  mild  change  in  precision. 

Figure  5  shows  that  omitting  the  first 
observation  has  virtually  no  impact  on  the 
posterior.  This  picture  also  has  a  very 
narrow,  sharply  peaked  density,  compared 
with  the  other  pictures.  Given  the  rest  of 
the  data,  the  location  of  this  observation  is 
quite  well  determined  and  consequently  it 
has  low  influence. 

6.  Discussion. 

There  is  one  posterior  influence  plot  per 
observation.  Rather  than  drawing  every 
picture,  some  influence  statistic  such  as  K(i) 
or  the  Li  norm  between  p(0  |  Y)  and  p(0  |  Y(i)) 
can  be  used  to  select  the  most  interesting 
pictures.  The  analyst  looks  at  the  posterior 


467 


influence  plots  for  observations  with  the 
largest  values  of  the  selector  statistic.  If. 
for  example,  the  plot  corresponding  to  the 
most  influential  point  does  not  show  a 
worrisome  amount  of  influence,  then  no 
further  plots  need  be  looked  at. 

The  posterior  influence  plot  covers  all 
differences  between  the  high-dimensional 
posteriors  p(e|Y)  and  p(e|Y(i)).  as  shown  in 
equation  (4).  Another  interpretation  and 
proof  of  this  statement  is  as  follows. 

Define  the  g-influence  of  yj  on  a  function 
P(0)  as 


lg(yi  on  P(0))  = 

p(^(0)|y) 


l^p(3(0)|Y(i))JP^®l'^'')^^®- 

where  p(3(0)  |  Y)  is  the  posterior 
distribution  of  ^(0)  given  the  data.  Then  it 
can  be  shown  that 


lg(yi  on  e)  =  lg(yi  on  ri(0))  (10) 

for  any  measurable  g.  In  a  strong  sense,  the 
posterior  influence  plot  loses  no  information 
about  the  influence  of  yi  on  the  posterior 
p(0|Y(i)).  a  statistically  interesting 
example  of  (9)  is  the  norm  between  the 
full  data  and  reduced  data  posteriors. 

When  the  parameter  ri(0.Xi)  is  two 
dimensional,  then  two  two-dimensional 
densities  p(r:i  |  Y)  and  p(ri  |  Y(i))  need  to  be 
compared.  Pairs  of  static  contour  plots  may 
not  be  satisfactory,  and  future  work  will 
include  the  development  of  a  system  for 
looking  at  pairs  of  two  dimensional 
densities. 

Other  work  is  needed  to  find  good  low 
dimeiisional  projections  of  p(0|y)  and 
p(0  1  Y(i))  when  'Ci  is  more  than  two 
dimensions.  Examples  where  this  is 
interesting  are  for  sets  of  influential 
observations,  and  for  t.  Jltivariate 
observations. 


7.  Conclusion. 

The  posterior  influence  plot  for  a 
particular  observation  is  a  graph  of  the 
posteriors  pCrijY)  and  p(^i|Y(i)).  The 
function  is  chosen  as  the  lowest 
dimensional  function  of  the  parameters  0 


that  '>f=termines  the  sampling  distribution  of 
the  itfi  observation.  The  posterior  influence 
plot  captures  all  of  the  influence  of  yi  on 
p(0  I Y).  since  the  marginal  densities 
p(pi  I  Y,ri)  =  p(pi  I  Y(i).-ci)  for  any  function  pi 
of  the  parameters.  Posterior  influence  plots 
are  possible  in  principle  for  any  statistical 
model,  and  are  practicable  for  a  wide 
variety  of  useful  statistical  models. 

Acknowledoements 

Thanks  to  Dennis  Cook  for  many 
conversations  and  advice.  Thanks  to  Kathryn 
Chaloner  for  proofreading  this  paper. 

REFERENCES. 

Bernardo,  J.M.  (1979).  Expected  information 
as  expected  utility.  Annals  of  Statistics. 
7,  686-690. 

Bernardo,  J.M.  (1985).  Comment  on  Pettit 
and  Smith's  article.  In  Bayesian 
Statistics  2.  J.M.  Bernardo  et  al..  eds,  p 
492-493.  Amsterdam:  North  Holland. 
Heyes,  J.K.  and  Brown.  R.  (1956).  Growth 
and  cellular  differentiation.  In  F.L. 
Milthorpe  (Ed.)  The  Growth  of  Leaves, 
Butterworth,  London. 

Johnson,  W.  and  Geisser,  S.  (1982). 

Assessing  the  predictive  influence  of 
observations.  In  Statistics  and 
Probability  Essays  in  Honor  of  C.R.  Rao. 
Kallianpur,  Krishnaiah,  and  Ghosh,  eds., 
Amsterdam:  North  Holland.  353-358. 
Johnson,  W.  and  Geisser,  S.  (1983).  A 
predictive  viow  of  the  detection  and 
characterization  of  influential 
observations  in  regression  analysis. 
Journal  of  the  American  Statistical 
Association.  78,  137-144. 

Johnson,  W.  and  Geisser,  S.  (1985). 
Estimative  influence  measures  for  the 
multivariate  general  linear  model. 

Journal  of  Statistical  Planning  and 
Inference,  11,  33-56. 

McCullagh,  P.  and  Nelder,  J.A.  (1983). 

Generalized  Linear  Models.  Chapman  and 
Hall. 

McCulloch,  R.E.  (1985).  Model  Influence  in 
Bayesian  Statistics.  Unpublished  PhD 
Dissertation,  University  of  Minnesota. 


468 


Pettit.  A.N.  and  Smith,  A.F.M.  (1985).  North  Holland,  473-494. 

Outliers  and  influential  observations  in  Ratkowsky,  D.A.  (1963).  Nonlinear 

linear  models.  In  Bayesian  Statistics  2.  Regression  Modeling.  Marcel  Dekker. 

J.M.  Bernardo  et  al.,  eds,  Amsterdam: 


469 


EXACT  POWER  CALCULATIONS  FOR  THE  CHI-SQHARE  TEST  OF  TWO  PROPORTIONS 

A 

Carl  E.  Plerchala,  U.S.  Department  of  Agriculture 


ABSTRACT 


Approximations  are  often  used  when  calculating  the 
power  of  the  Pearson  Chi-Square  test  of  two  Inde¬ 
pendent  proportions.  This  speeds  up  the  compu¬ 
tations  and  simplifies  programming.  At  times, 
however.  It  is  useful  to  directly  compute  the 
exact  power.  For  example,  one  may  wish  to  assess 
an  approximation's  adequacy  In  a  specific  situ¬ 
ation.  Thus,  an  APL  program  was  developed  to  do 
exact  power  calculations  on  an  IBM  PC/XT.  It 
gives  accurate  and  reasonably  fast  computations. 
The  exact  power  values  for  certain  circumstances 
are  compared  to  the  corresponding  values  obtained 
using  two  approximations,  one  of  which  is  based  on 
the  arc  sine  transformation.  It  Is  seen  that 
these  approximations  are  quite  Inaccurate  in  some 
situations. 


KEYWORDS:  Pearson  Chi-Square  Test,  two-by-two 
tables,  proportions,  power,  APL,  Personal 
Computers,  adverse  events,  arc  sine  transformation 


1 .  INTRODUCTION 

In  a  clinical  trial  being  planned  to  compare  a 
placebo  group  with  an  active-treatment  group, 
there  was  concern  that  a  rare  but  serious  adverse 
event  would  be  more  likely  to  occur  with  the 
active  treatment.  It  was  anticipated  that  the 
(uncorrected)  Pearson  Chi-Square  test  would  be 
used  to  test  the  null  hypothesis  of  no  difference 
In  proportions  of  Individuals  suffering  the 
adverse  event.  However,  the  question  arose  as  to 
whether  the  study  would  have  sufficient  power 
using  this  test  to  detect  a  several-fold  Increase 
In  the  adverse  event  rate  In  the  active-treatment 
group. 

Many  authorities  (e.g.,  Cohen,  1977;  Brownlee, 
1965)  recommend  using  the  "arc  sine"  transfor¬ 
mation  to  compute  approximate  power  for  a  test  of 
equality  of  two  Independent  proportions.  Note 
that  while  the  test  for  equality  of  proportions  is 
often  given  as  a  z-test ,  It  follows  by  an  argument 
analogous  to  that  of  Flelss  (1973)  that  the 
uncorrected  version  of  the  z-test  Is  equivalent  to 
Pearson's  test. 

Two  approximations  to  the  power  of  a  test  of 
equality  of  two  proportions  were  available  in  an 
APL  library  at  the  FDA's  Center  for  Drug 
Evaluation  and  Research,  where  the  problem 
motivating  this  paper  first  arose.  One  version 


The  work  reported  In  this  paper  was  begun  while 
the  author  was  employed  by  the  Food  and  Drug 
Administration,  and  was  continued  after  the  author 
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ture.  The  views  expressed  In  this  paper  are  those 
of  the  author,  and  not  necessarily  those  of  either 
the  Food  and  Drug  Administration  or  the  United 
States  Department  of  Agriculture. 


(POWHALD) ,  based  on  the  arc  sine  transformation, 
was  attributed  to  Hald  (1952,  pp.  705  ff).  The 
other  (POWHSU)  was  not  well  documented;  It  was 
described  as  similar  to  the  Hald  version,  but 
without  an  arc  sine  transformation.  Thus,  In  a 
preliminary  attempt  to  evaluate  the  power  of  the 
test  for  the  situation  at  hand,  some  computations 
were  done  using  both  approximations. 

It  was  assumed  that  a  one-sided,  nominal  5% 
level  test  would  be  used,  and  that  both  samples 
would  be  of  size  375,  In  the  placebo  group,  the 
probability  of  the  occurence  of  the  adverse  event 
was  assumed  to  be  .001.  In  the  treated  group, 
the  probability  of  the  adverse  event  was  varied 
from  .001  to  .020.  Unfortunately,  as  will  be 
seen  In  more  detail  below,  the  two  approximations 
did  not  always  give  very  similar  values  for 
approximate  power.  For  example,  when  a  treated 
patient  was  assumed  to  have  probability  .010  of 
having  the  adverse  event,  the  Hald  approximation 
gave  .59  for  the  power,  while  the  alternate 
approximation  gave  .51.  Thus,  there  was  a  ques¬ 
tion  as  to  which.  If  either,  of  the  two  approxi¬ 
mations  was  better. 

In  addition.  It  was  noted  that  for  an  N  of  375 
and  a  P  of  .001,  NP  equals  ,375.  This  Is  much 
smaller  than  5,  which  Is  a  conventional  criterion 
for  deciding  if  it  Is  appropriate  to  do  a  Chi- 
Square  test  (Brownlee,  1965,  p.  153).  By  impli¬ 
cation,  this  led  to  the  question  as  to  whether  the 
arc  sine  transformation  would  provide  an  adequate 
approximation  to  the  power  In  such  a  situation. 

To  answer  these  questions.  It  seemed  desirable 
to  attempt  to  do  an  exact  calculation  of  the  power 
of  Pearson's  Chi-Square  test.  Upon  reflection.  It 
became  clear  that  this  Is  conceptually  fairly  easy 
to  do  under  a  conventional  probability  model. 
Development  of  an  APL  function  to  do  the  computa¬ 
tions  was  thus  undertaken.  This  paper  gives  a 
progress  report,  and  reports  some  computations 
that  shed  light  on  the  questions  raised  above. 


2.  THEORETICAL  BACKGROUND 

In  this  section,  the  theory  behind  the  compu¬ 
tation  of  the  power  of  the  Pearson  Chi-Square  teat 
is  reviewed.  The  notation  used  In  this  paper 
mimics  that  used  in  the  code  in  the  APL  function 
that  does  the  computations. 

Suppose  we  observe  Nj  identically  and  In¬ 
dependently  distributed  (IID)  bernoulll  random 
variables  from  one  population,  and  N^  such 
variables  from  a  second  population.  That  Is, 

Xjj,  Xj2 . are  IID  B(1,P,), 

and 


X2,.  X^2. 


X^^  are  IID  B(l,Pj), 


with  X| 


Independent  of  X^j^. 
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The  results  could  be  displayed  as  follows. 


POPU-  SAMPLE  NUMBER  OF  SAMPLE 

LATION  SIZE  SUCCESSES  PROPORTION 


It  follows  from  the  suppositions  above  that  S^ 
and  S,  are  statistically  Independent  binomial 
random  variates.  That  Is,  S  Is  B(N  ,P  ),  for 
1-1,2.  ^  ^ 


Pg  =  (N^!/rs^!(N^-S^)!])P^^I(l-pp”r^l 

Is  a  binomial  probability,  S  -0,1,2 . N  ,  for 

1-1,2. 


Now  the  power  of  the  one-sided  test  Is  given  by 


(1)  .  rp 

W  -  Sj,S 


where  the  sum  Is  taken  over  (Sj,S_)  In  Rj, 
while  the  power  of  the  two-sided  test  Is  given  by 


where  the  sum  is  taken  over  (S^S^)  in  R2. 


Note  also  that  the  results  could  be  displayed  In  summary,  conceptually  It  Is  easy  to 

In  a  conventional  two-by-two  table  as  follows.  calculate  the  power.  However  in  practice  there 

are  problems  due  to  the  large  numer  of 
computations  and  decisions  as  to  which  points  In 
SUCCESSES  FAILURES  TOTALS  the  sample  space  are  in  the  rejection  region. 


POPULATION  1 


POPULATION  2 


S1+S2  Nj+N2-(Sj+S2)  Nj+N2 


Now  we  may  be  interested  either  In  a  one-sided 
test , 

Hq!  P]  >  ^2  ^1  ^2’ 

or  a  two-sided  test, 

Hq:  Pi  -  P2  vs.  Hj:  P^  ^  P2. 


In  either  case  we  will  compute  Pearson's  Chi- 
Square  statistic. 


2  (Nj+N2)[Sj(N2-S2)-S2(Nj-Sj)]^ 

NiN2(Sj+S2)[Ni+N2-(Si+S2)] 


Let  ct  denote  the  nominal  significance  level  of 
the  test.  The  one-sided  test  Is  significant  If 
both  2  2 

Pj  <  P2  and  X  > 

The  two-sided  test  Is  significant  If 

2^2 

Note  the  difference  In  critical  values. 

Now  the  sample  space  2  consists  of  points 

(SpS2),  where  Sj-0,1,2 . Nj  and  82-0,1,2, 

...,N2.  Let  Rj  and  R„  denote  the  rejection 
regions  (l.e.,  the  subsets  of  points  In  the  sample 
sapce  for  which  the  tests  are  significant)  for  the 
I-slded  and  Z-slded  tests,  respectively. 


The  probability  of  any  (S|,S2)  Is 


P  -  P 
S,,S2  S, 


•  ’’s  • 
^2 


3.  COMPUTATIONAL  APPROACH 

Now  It  Is  to  be  noted  that  the  sample  space  Is 
a  lattice  of  points,  that  Is,  a  rectangular  array. 
Thus  for  each  point  In  the  sample  space,  the 
corresponding  element  of  an  appropriate  matrix  can 
store  the  probability  of  each  (S, ,82),  Its  x 
value,  and  Its  membership  In  eltner^Rj  or  R2. 

Using  the  APL  programming  language ,  one  can 
readily  do  appropriate  matrix  computations  and 
form  the  appropriate  sums  to  obtain  the  power. 

Furthermore,  one  can  speed  up  the  computations 
by  Ignoring  points  In  the  sample  space  having 
trivially  small  probabilities.  Basically,  the 
Idea  Is  that  there  Is  a  rectangular  subspace  of 
the  sample  space  "centered"  about  the  expected 
value  (N  p  ,N2P2)  of  the  random  variable  (S.  ,$2), 
such  that  for  points  outside  the  subspace  the 
cumulative  probability  Is  negllgably  small.  Thus 
such  points  can  be  Ignored,  so  the  computations 
are  speeded  up  by  only  performing  them  over  the 
subspace.  All  that  needs  to  be  done  Is  to 
determine  the  boundaries  of  this  rectangle,  that 
Is,  the  "marginal  subranges"  of  the  S.  and  the  S2 
for  which  the  various  matrix  computations  are  to 
be  done. 

Two  different  strategies  were  experimented  with 
for  determining  these  marginal  subranges.  The 
approach  currently  being  used  Is  to  calculate  the 
P  ,  for  S  -0,1,...,N  ,  and  to  only  utilize 
®1  ^  ^ 

those  S  for  which  P_  >  E/N, ,  for  some  small 
1  1 

value,  E.  A  bit  of  thought  shows  that  the  proba¬ 
bility  of  not  being  In  the  rectangular  subspace  Is 
less  than  2E,  which  Is  thus  a  conservative  upper 
bound  on  the  error  In  the  computed  power. 

An  earlier  strategy  which  was  abandoned  was  a 
kind  of  normal  approximation.  This,  Involved  using 
S  In  the  range  [N  P  lk(N  P  ( 1-P  )  )*^]  ,  where 
k^Is  an  appropriate  Constant.  However,  although 
this  approach  to  determining  the  marginal 
subranges  proved  to  be  computationally  quick.  It 
was  not  always  accurate  even  with  k  as  large  as  5 
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or  6,  In  particular  for  small  P^. 

Another  technical  point  Is  about  the  computa¬ 
tion  of  the  binomial  probabilities.  This  Is  done 
In  part  with  a  function  named  LNFAC  from  the  FDA's 
APL  library.  This  function  calculates  the  base  e 
logarithm  of  N!  (N  factorial).  Using  logarithms 
and  LNFAC,  one  readily  obtains  log  P^  and  then 
exponentiates  to  obtain  P_  .  1 

^1 

This  Indirect  approach  to  calculating  the 
binomial  probabilities  Is  slower  than  the  more 
obvious  direct  calculation  of  binomial 
coefficients  multiplied  by  powers  of  the 
appropriate  probabilities.  However,  the  APL 
function  that  calculates  N!  won't  work  for  N 
larger  than  170,  so  when  either  of  the  exceeds 
170,  the  direct  approach  cannot  be  used.  Using 
the  Indirect  approach,  accurate  computations  can 
be  done  for  larger  than  170,  making  the  routine 
more  versatile. 

With  regard  to  the  calculation  of  the  matrix  of 
chi-square  values,  which  are  used  In  determining 
the  critical  region  for  the  test,  the  computa¬ 
tional  formula  given  above  Is  used  In  coniuntctlon 
with  APL' a  matrix  capabilities  to  yield  the 
matrix.  The  APL  code  to  do  so  Is  done  In  stages 
In  several  lines  rather  than  one  line,  thus 
reducing  the  need  for  temporary  storage  of  inter¬ 
mediate  results.  This  matrix  approach  proved  to 
be  much  faster  than  an  earlier  approach  In  which 
looping  was  used  In  combination  with  a  function 
that  computes  I(o-e)*/e  for  a  four-fold  Cable. 

The  funclon  BINPOWR  that  does  the  computations 
was  programmed  using  Version  6.4  of  STSC  Inc.'s 
APL*PLUS  on  an  IBM  PC-XT.  A  general  flow  chart  of 
the  computational  proceedure  Is  given  in  Figure  1. 


4.  RESULTS 


To  date,  a  version  of  the  function  BINPOWR  has 
been  developed  which  Is  accurate  and  Is  fairly 
fast.  Some  speed  results  are  given  In  the  text 
below.  In  developing  the  routine,  extensive 
checks  on  the  accuracy  were  made.  Power  values 
were  calculated  for  certain  simple  cases.  Includ¬ 
ing  the  example  given  by  Conover  (1971,  p.  146). 
In  all  such  checks,  the  function  gave  the  correct 
answer.  In  addition,  Garslde  and  Mack  (1976)  give 
exact  probabilities  of  rejecting  the  null 
hypothesis  of  the  equality  of  two  proportions. 

This  varies  as  a  function  of  the  common  proportion 
assumed  In  the  computation.  Their  values  could  be 
verified  by  using  BINPOWR  with  P,  equal  to  P2.  In 
doing  so.  It  was  noted  that  Garslde  and  Mack 
performed  their  computations  for  a  one-sided  test. 
At  any  rate,  a  variety  of  Garslde  and  Mack's 
computations  were  checked,  and  In  every  such  case 
the  value  computed  using  BINPOWR  agreed  with  their 
value  to  the  precision  they  reported. 


However,  there  are  still  some  problems  with  the 
function.  First,  It  Is  not  readily  usable  as  a 
function.  It  needs  to  be  made  more  "user- 
friendly".  Second,  the  computation  of  the  margin¬ 
al  subranges  seems  slow.  Possibly  an  approach  to 
speed  this  up  can  be  found.  Third,  no  estimate  is 


FIGURE  1.  Flow  Chart  for  the  Computations. 


given  for  the  error  in  the  reported  exact  power 
value  due  to  restricting  the  computations  to  the 
subspace  determined  by  the  marginal  subranges. 

Such  an  estimate  could  readily  be  calculated  and 
returned  by  the  function.  Finally,  the  code 
for  calculating  chi-square  can  be  adjusted  very 
simply  to  Include  various  kinds  of  continuity 
corrections.  It  would  be  relatively  simple  and 
highly  desirable  to  add  such  a  feature  to  the 
routine.  Thus,  BINPOWR  Is  still  being  refined, 
and  Is  not  yet  available  for  distribution. 

The  results  alluded  to  In  the  introduction  are 
given  In  Table  1.  They  were  computed  using  an  IBM 
PC-XT  with  an  8087  math  coprocessor  chip  and  640K 
memory.  With  the  particular  parameters  used  In 
this  table,  approximately  92  to  93  seconds  were 
required  for  each  power  value  computed  using 
BINPOWR,  In  the  computations,  the  value  E  ■  10 
was  used.  Thus,  the  values  in  the  table  are 
correct  to  the  reported  number  of  digits. 

For  the  set  of  P2  values  In  Table  1 ,  It  can  be 
seen  that  the  Hald  approximation  to  the  power  at 
least  equals  and  usually  exceeds  that  of  the 
alternate  approximation.  The  difference  in  nomi¬ 
nal  power  values  between  the  two  approximations  is 
greater  than  .05  when  P^  either  Is  .020  or  Is  In 
the  range  from  .008  to  .012.  The  difference  Is 
greater  than  .10  when  P.  is  in  the  range  from  .014 
to  .018. 
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TABLE  1.  POWER  OF  THE  1-SIDED,  NOMINAL  51  LEVEL 
PEARSON  CHI-SQUARE  TEST  OF  THE  EQUALITY 
OF  TWO  PROPORTIONS,  AS  A  FUNCTION  OF  P-, 
FOR  Pj-.OOl  AND  Nj-N  -375.  ^ 


COMPUTATIONAL 

METHOD 

^2 

HALD 

APPROX¬ 

IMATION 

ALTERNATE 

APPROX¬ 

IMATION 

EXACT 

COMPUTATION 

.001 

.0500 

.0500 

.0045 

.002. 

.0992 

.0984 

.0278 

.004 

.2183 

.2056 

.1323 

.006 

.3492 

.3140 

.2759 

.008 

.4768 

.4165 

.4200 

.010 

.5918 

,5098 

.5474 

.012 

.6897 

.5925 

.6539 

.014 

.7694 

.6643 

.7403 

.016 

.8321 

.7258 

.8088 

.018 

.8799 

.  7776 

.8621 

.020 

.9154 

.8209 

.9024 

*  Note 

change  In 

Increment 

The  exact  power  values  In  Table  1  are  often 
substantially  different  from  those  given  by  the 
approximations.  Making  comparisons  across  each 
row  of  the  table,  the  following  Is  seen.  For 
small  P. ,  the  exact  power  values  are  substantially 
below  the  corresponding  values  from  the  approxima¬ 
tions;  this  Is  most  striking  for  P2  “  .001,  where 
the  exact  value  of  .0045  Is  an  order  of  magnitude 
smaller  than  that  of  .0500  given  by  each  approxi¬ 
mation.  On  the  other  hand,  for  moderate  P2  (from 
.006  to  .010),  the  alternate  approximation  Is 
within  .05  of  the  exact  value.  For  larger  P, 

(from  .010  to  .020),  the  Hald  approximation  Is 
within  .05  of  the  exact  value.  Finally,  It  can  be 
seen  that  the  alternate  approximation  gives  a 
value  that  Is  higher  than  the  exact  power  value 
when  P2  is  less  than  .008;  otherwise,  the 
alternative  gives  a  lower  value.  On  the  other 
hand,  the  Hald  approximation  always  gives  a  higher 
value  than  the  exact  power  value. 


5.  CONCLUSIONS 

A  few  conclusions  can  be  drawn  at  this  point  In 
the  development  of  BINPOWR.  First,  STSC's  APL  on 
an  IBM  PC-XT  gives  accurate,  relatively  quick 
exact  power  computations.  This  can  be  useful  at 
the  very  least  for  spot  checking  approximations, 
which  are  seen  to  be  Inaccurate  In  some  cases. 

This  Inaccuracy  was  particularly  extreme  In  the 
case  where  the  null  hypothesis  is  true  and  the 
common  proportion  Is  relatively  small.  Finally, 
it  Is  to  be  noted  that  APL  programming  Is  quite 
time  consuming  for  the  novice  APL  progranmer. 
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On  Covariances  of  Marginally  Adjusted  Data 


Jam€s  S.  Weber, 

A  procedure  for  estimating  covariances  of  marginally  adjusted  data 
in  terms  of  first  partial  derivatives  and  covariances  of  the  unsealed 
data  and  prescribed  marginal  sums  is  given.  A  numerical  example 
demonstrates  the  dependence  of  these  covariances  upon  the  balancing 
procedure  used  to  maintain  consistency  of  sets  of  marginal  sums  in 
the  presence  of  errors  in  the  marginal  sums. 

KEY  WORDS.  Iterated  Proportional  fitting  algorithm;  IPFA;  RAS 
procedure;  IPS;  Contingency  table;  Interaction  matrix;  Gravity  model; 
Input-output  model;  Partial  differentiation;  Diagonally  equivalent 
matrices. 

I.  Introduction 

Categorical  data  may  be  presented  as  rectangular  tables  of  rows 
and  columns  using  two  subscripts  or  as  more  general  arrays  with  three 
or  more  subscripts.  Applications  of  marginally  adjusted  categorical 
data  include  adjusted  census  data,  migration  modeling,  updating  of 
Leontief  input-output  coefficients,  journey-to-work  trip  distribution 
modeling  and  certain  budgeting  allocation  problems.  (See  Bacharach, 
1970,  Weber  1987,  etc.) 

In  this  paper  we  discuss  the  estimation  of  covariances  of  adjusted 
data  in  terms  of  covariances  of  initial  entries  and  prescribed  marginal 
sums.  Two  related  issues  are  prominent.  1.  The  dependence  of 
covariances  (and  derivatives)  on  the  way  that  inconsistent  marginal 
sums  are  made  consistent;  2.  The  calculation  of  partial  derivatives  of 
scaled  entries  with  respect  to  initial  entries  and  marginal  sums.  The 
format  of  the  paper  is  as  follows.  In  section  2  we  specifically  describe 
row  and  column  adjustments  of  tables  of  data.  In  section  3  we  look  at 
estimation  of  covariances  of  the  marginally  adjusted  data.  Then  we 
contrast  our  approach  with  others.  Also  a  more  general  framework  is 
indicated.  Finally  appendices  give  a  general  leai't  squares  estimate  of 
marginal  sums  and  a  sketch  of  a  proof  of  convergence  of  our  sequence 
of  derivatives.  While  some  relevant  comments  have  appeared  in  the 
literature,  f-w  focus  explicitly  on  the  manner  by  which  inconsistent 
constraints  are  revised  to  become  or  remain  consistent.  (See  Weber. 
1987).  Consistency  of  marginal  sums  is  central  to  understanding 
covariances  of  marginally  adjusted  tables  or  arrays.  From  time  to 
time  we  point  out  details  for  computing  these  covariances. 

2.  Marginally  Adjusted  Tables  of  Data 

We  describe  the  basic  calculating  procedure  for  scaling  tables 
of  data  to  have  prescribed  row  and  column  sums.  A  is  an  m  x  « 
po.sitive  matrix  of  initial  values.  are  vectors  of  in  row 

and  n  column  marginal  sums,  ii  is  an  m  x  n  scaled  matrix  with 
the  prescribed  marginal  Mims  A/*’,  Af*^,  D\,  are  diagonal  scaling 
factors.  To  accommodate  variation  in  marginal  sums  which  at  one 
moment  we  regard  as  free  to  vary  in  conformity  with  some  covariance 
matrix  but  which  in  some  way  must  also  remain  consistent,  we  write 
A/*‘ ,  A/*^  a.s  functions  of  Af  * ,  Af wherein  Af ' ,  vary  freely  and 
upon  which  A/"^  and  Af*^  depend 

Kquation.s  (labc)  (-lab)  .summarize  our  setup. 


(rn;')  =  M"'  =  jr|(,\/',Af-)  >  0 

(U) 

A 

11 

It 

(11.) 

11 

(Ic) 

fl.j  >  0 

(2) 

b,j  =  (/,  a,j  (ij 

II  =  DiADi 

Ctl-) 

Roosevelt  University 


rv 


=  m." 

J  =  1 

(4a) 

m 

^  m;- 

(4b) 

1=1 


It  is  well  known  that  (Ic).  (3a),  (4ab)  uniquely  determine  6,j’s 
(Sinkhorn.  1967  and  many  others).  We  assume  that  A/'.Af^  >  0. 
Equations  (Id),  (le),  below,  give  instances  of  T\,T'2  in  (lab).  (See 
Weber,  1987,  p.  626.  (5).  (6)). 

n  m 

-  ^mj)/(m  -t-  n)  (Id) 

j=i  .=1 

n  m 

=  mj  -  (52”^;  '  +  ")  ^e) 

;=1  .=1 

Obviously  i^  == 

It  is  also  well  known  that  6,j’s  may  be  calculated  by  itera¬ 
tively  scaling  the  initial  and  subsequent  values.  This  procedure  has 
many  names^  including  “Iterated  Proportional  Fitting  Algorithm"  or 
“IPFA”.  Since  this  .scaling  procedure  is  used  to  c.stimate  covariances, 
it  is  expressed  below  as  (5)  followed  by  the  iteration  of  (6a)  and  (7a) 
(or  (5)  followed  by  the  iteration  of  (6b)  and  (7b)), 


0., 

(5) 

= 

,  M3r 

r  =  0,. 

.  oc 

(6a) 

6^'  = 

(  )i>^'' 

A?** '  D 

r  =  0., 

.  'X' 

(6b) 

,  ’> 

(  ) 

r  =  0 

.  .  X 

(7a) 

6?;+^  = 

r  =  0. 

.  .  X  , 

(7b) 

'fhat  Is,  (5)  followed  by  iteration  of  (6a).  (7a)  (or  (Ob),  (7b))  rapidly 
converge  a  limiting  rnairiv  Ly  or  or  'simply  W#“ 

may  write 

li  =  «(.!,. 

=  rlr. 

Often  only  (0).  (0;t),  (7a)  (or  (5),  (Gb).  (Th))  am  regar<led  as  tbr 
wrll  known  .scaling  f>roce<lure'  for  computing  satisfying  ( Ic).  (3a), 
(•lab).  However,  for  our  pur|>oses  (lali)  must  he  regarded  a-s  an  ex¬ 
plicit  and  integral  part  of  the  scaling  procedure.  Hence  our  complete 
description  of  the  marginal  adjustment  procedure  is  (lab),  (a),  and 
iteration  of  (6a)  aiul  (7a)  (or  (Ob)  and  (71))) 

3.  ('OVARIANCT.S  OK  MAHtilNAll.V  .AD-M  sIKD  'IaHI-KS  OK  DaTA 
We  want  to  «‘stimafe  the  rovananre  matrix  ('()\’(  li,  Af"' .  Af"*^) 


wliirh  may  be  partilti*ned  .a-s 
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cov(B,D)  cov(fl,  A/*')  cov{B,M'^) 

cov(A/*‘,B)  covjw'.M”) 

cov(A/’^B)  cov(A/*’,,'.f“)  cov(A/’2,  A/'^) 


(8b) 


in  terms  of  COV(y4,  Af‘ ,  A/^).  This  can  be  approximated  as  aii  in 
stance  of  COV(T(X))  S  (dF/dX)COViX){dF/dXf,  namely 


COV(B,A/’',A/*^)a! 


/'BBAf 'Af2\ 
V  dAM'-M^  ) 


COV(AAf‘A/^) 


V  dAM‘M^  ) 


(9) 


wherein  ^{d/dy  denotes  the  appropriate  Jacobian  matrices  of  deriva¬ 
tives  and  T  denotes  a  matrix  transpose.  We  next  focus  on  computing 


3.2  DERivATivts  OF  Scaled  Matrices 

Some  further  discussion  of  notation  is  required.  Expressions  (5), 
(la),  (Ib)  can  be  combined  to  form  a  single  vector  function 
Af*^)  of  (A,  ,  M^).  Obviously  B^{A)  simply  maps  A  to  the  6,^ 

entries  via  a  suitable  identity  map.  A/**  and  A/*^  are  computed  in 
a  single  step  but  are  always  needed  to  iteratively  scale  the  matrices 
^2r  ^2r+j  can  be  combined  with 

identity  maps  for  Af  Af  to  define  a  function  (/?^'''*'\  Af  Af  *^)  of 
(B^^ ,  ,  M*^).  Doing  this  makes  the  augmented  functions  iterate 

vectors  of  the  same  dimension  from  the  beginning  to  the  end  of  the 
iteration  process  allowing  the  calculus  chain  rule  to  apply  in  a  simple 
way.  Finally  the  augmented  scaling  (6a)  or  (7b)  may  be  abbreviated 
by  “/?”  for  a  row  adjustment  augmented  by  identity  maps  to  carry 
along  Af'^Af*^.  The  augmented  (7a)  (or  (6b))  may  be  abbreviated 
by  “C”  for  column  adjustment  augmented  by  identity  maps  to  carry 
along  Af•^A^“^  That  is,  a/-2)  _  Af* .  Af*  2). 

etc.  With  these  notational  changes  we  may  rewrite  (9)  as 


COV(B,A/*^A/•= 
dC 


-U— 

I  [\dBM-'M'^  j  V5BAf>Af2 ) 


BBUfAf*’]'! 

dAM'M^  Jj 


CO\(A,M',M^) 


(10) 


[\(.  ■.i£. 

llV^BAf'AfV  \9BAf'AfV 


SBOAfAf-nl^ 
dAM'KP  J 


IS  or  19 


=  n  o 


for  r*  appropriately  identified  with  the  above  matrices.  Note  that 
covariances  may  be  calculated  individually  since 

cov(6.,,6„)  =  (COV(B,Af*',Af^)]„.„„+, 

^mn  +  m  +  n  r-*rnn  +  m  +  n  i  /ii\ 

.-y3 

7*1*3  ■  ■  ■  ljn  +  ? 

wherein  (i  —  l)n  -F  j,  (p  —  l)n  -P  9  change  double  indices  to  a  single 
index. 

The  preceding  gives  an  overview  of  a  computational  procedure. 
We  refer  the  reader  to  Weber  (1987).  Weber  and  Sen  (1985),  Weber 
and  Sen  (1983)  and  Weber  (1981)  fo''  additional  details.  Here  we 
summarize  only  key  ideas  of  those  papers  as  they  relate  to  covariances 
of  marginally  adjusted  data. 

Obviously  many  individual  partial  derivatives  are  recpiired  to 
evaluate  expression.s  (10)  or  (11)  Let  us  now  look  more  closely  at 
these^. 

We  may  write 


/  OR 

/'dB'‘'+',M"' 

\aB.M-' 

V  dB'^'.M"', 

M"^  j 

~dW 

dM"' 

dB'''*'  1 
dM"^ 

aw 

OM"' 

dM"' 

dM" 

aM"'^ 

dM‘^ 

'dW 

dM"" 

dM"' 

aM"'‘ 

dM"'‘ 

db]\ 

dbj". 

96?^+' 

db''"-^'  I 
dm"„‘ 

56?;+' 

56?;+' 

dbh 

0 

I 

dbmn 

d^ 

0 

.  0 

0 

/  . 

(12a) 


(12b) 


(12c) 


Obviously  the  format  above  bears  a  family  resemblence  to  the  co- 
variance  matrices  (Sab)  for  which  (12abc)  is  used  to  estimate.  To 
put  formulae  or  values  in  (12c)  we  may  differentiate  (6a),  (7a)  and 
enter  either  formulae  or  values.  (See  Weber  and  Sen,  1985a).  The 
entire  matrices  dR/dBM*^  ,dC/dB,  can  be  manipu¬ 

lated  or  formulae  for  the  entries  can  be  coded  as  functions  of  indices, 
B  A,  M*',  and  Af*^.  Then  covariances  (as  well  as  derivatives,  if 
desired)  may  be  computed  all  at  once  as  a  matrix  using  (10)  or  as 
subsets  or  individually  via  (11).  Obviously  there  may  be  a  wealth  of 
applicable  techniques  for  fast  and  efficient  handling  of  large  matrices. 


3.3  A  Numerical  Example 

The  following  discussion  of  a  5  x  5  example  shows  that  linearly 
estimated  covariances  depend  on  the  procedure  by  which  inconsisten¬ 
cies  are  resolved.  Weber  (1987)  gives  a  numerical  example  showing 
the  dependence  of  constrained  gravity  model  derivatives  on  the  bai- 
ancing  procedure.  Mathematically,  constrained  gravity  models  are 
identical  to  the  row  and  column  marginally  adjusted  tables  of  data 
that  we  have  been  discussing. 

Table  1  gives  hypothetical  data  for  .4,A/'.A/2  and  B.  This 
example  is  borrowed  from  Webei  (1987;  1981)  and  is  reasonable 
in  the  context  of  those  discussions.  Here  all  that  matters  is  that 
A,  Af* .  A/2,  B  >  0,  which  is  obviously  true. 

Table  1 

Iterated  Proportional  Fitting  Example 


A:  Initial  Interaction  Matrix 


0.7828 

0.6128 

0.4512 

0.4331 

0.3679 

0.6128 

0.7363 

0.6128 

0.5882 

0.49s<6 

0.4512 

0.6128 

0  7515 

0.576: 

0.1512 

0  4331 

0.5882 

0.5764 

0.7214 

0..5099 

0.3679 

0  4996 

0.4512 

0.5099 

0,7068 

M'/2M 

•'.  Row 

■Suras 

m' 

nin 

">3 

mj 

'»5 

6500.00  8000.00 

5280.00 

8.50.00 

3730.00 

m 


2 

i 


6690.00 


A/2A:A/"2  Column  Sums 

2  7  2  2 

FTij  T7I3  ni,  nij 

80,3000  ^27000  84000  mOOO 
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B'.  Marginally  Adjusted  Values 


2431.37 

2078.87 

1108.14 

180.44 

701.18 

2143.40 

2813.23 

1694.93 

275.98 

1072.47 

1151.43 

1708.13 

1516.59 

197.29 

706.57 

189.66 

281.36 

199.58 

42,37 

137.03 

744.14 

1148.42 

750,76 

143.92 

912.76 

M"  = 

A/'  and 

Af*2  =  M 

^  since 

r-*Afn  1 

Zi=i 

In  B,  the  row  sums  are  given  by  the  column  sums  are  given  by 
and  B  =  D\A  D2  where  D\  and  D2  are  diagonal  matrices.  B 
was  obtained  by  the  Iterated  Proportional  Fitting  Algorithm. 

Table  2,  below,  gives  the  partial  derivatives,  d6ii/^m}mjm3mj 
for  three  different  balancing  procedures:  A,  B 
{BL  and  BR)  and  C.  The  procedures  A,  BL,  BR  and  C  are  defined 
immediately  after  Table  3b.  See  Weber  (1987)  for  a  more  lengthy 
discussion. 

Table  2 


Example  2.  COV(A)  =  0.  COV(A/'A/2j  =  iqx  10  diagonal  matrix 
of(0.0,0, 1,1,0, 1,0, 1,0).  That  is,  ,  =  \  and 

'  mj  rUj  mj 

the  others  are  zero. 

Table  3b 


Balancing 

Procedure 

a? 

^11 

A 

0.00528653 

BL 

0.01693739 

BR 

0.01364130 

C 

0  01C28051 

Largest  crj^^  smallest  5f  3.20 

For  both  examples,  the  balancing  procedures  for  are 

defined  below. 


dbu/dMHO 


mi 

mi 

1 

m^ 

m[ 

3 

m , 

ml 

ml 

m  J 

"»s 

A 

292 

-  0482 

-  0330 

-  0346 

-  0303 

283 

-  0462 

-  0308 

-  0322 

-  276 

BL 

343 

00248 

0177 

0161 

0204 

232 

-  0969 

-  0815 

-  0829 

-  0783 

BR 

243 

-  0925 

0833 

-  0838 

-  0795 

332 

0031 

0258 

0170 

0216 

C 

323 

-  0179 

-  0027 

-  0043 

0 

232 

-  0969 

-  815 

~  0829 

-  0783 

A.  Af*'  and  Af^  are  least  squares  estimates  such  that  A/**,A/*^ 
are  consistent  and  the  sum  of  the  squared  deviations  from  Af^ 
and  is  as  small  as  possible.  The  formula  is  given  by  (Id), 
(le). 

B.  DLfiBR  correspond  to  the  set  of  marginal  constraints  for  which 
the  constraints  which  sum  to  the  larger  value  are  scaled  down  to 
sum  to  the  smaller  sum.  BL  and  BR  are  right  and  left  deriva¬ 
tives  and  probably  the  variances  associated  with  BR  and  BL 
should  be  averaged. 


We  use  these  derivatives  to  estimate  covariances  for  hypothetical 
covariances  of  A,  M' ,  In  the  two  examples  which  follow,  we  let 
COV(j4)  =  0.  (The  effect  of  including  a  diagonal  covariance  matrix 
COV(/t)  simply  would  be  to  add  a  constant  to  the  numerical  values 
we  obtain  for  erj^^  below). 


-(6..,6..)^  (5^)cOV(AAfAf^) 

(13) 

■  since COV(A)  =  0. 


C.  Here,  a  single  marginal  sum,  absorbs  all  of  the  inconsis¬ 

tencies  arising  from  variability  of  the  others. 


In  this  subsection  we  established  our  claim  that  covariances  de¬ 
pend  on  the  balancing  procedure.  Obviously  we  would  like  to  be  able 
to  report  having  done  a  simulation  as  a  check  on  the  examples  re¬ 
ported  here,  but  this  has  not  been  done.  (Note  however  that  Weber 
and  Sen,  1985a,  compute  all  of  the  covariances  for  a  different  numeri¬ 
cal  example  by  both  a  linearization  and  a  simulation,  obtaining  close 
agreement). 


Example  1.  COV(A)  =  0,  COV(A/*A/^)  =  l\o  =  a  variance 
times  the  10  x  10  identity  matrix.  Then 

wherein  |{  •  ||  is  ordinary  Euclidean  length. 


Table  3a 

Balancing 

Procedure 

<7? 

A 

<r^  0,17576217 

BL 

<T^  0.20150297 

BR 

<r^  0  19954164 

C 

0.18753435 

Largest  <7^^^  smallest  1,15.  In  some  situations,  = 

4872  would  be  reasonable 


3  4  Computational  Alternatives 

These  covariances  may  be  estimated  by  using  simulations,  how¬ 
ever  there  are  a  nuniber  of  disadvantages  with  simulations. 

1.  Changes  in  the  bij's  are  highly  correlated  and  therefore  a  large 
number  of  simulated  points  will  be  necessary.  (See  Weber  and 
Sen.  1985a,  for  discussion  of  an  empirical  stopping  rule  which 
can  be  used  in  this  situation). 

2.  All  of  the  6,j ’s  need  to  be  calculated  for  each  set  of  simulated 
values  for  A,A/*,  A/^.  In  contrast,  linearly  approximated  covari¬ 
ances  can  be  programmed  to  provide  covariances  individually,  if 
desired, 

3.  It  is  difficult  to  simulate  random  vectors  with  other  than  a  diag¬ 
onal  covariance  matrix.  This  could  be  a  problem  for  a  simulation 
but  not  for  a  linear  approximation. 


However,  for  contingency  tables  which  are  not  too  big.  when 
there  is  access  to  a<lequate  computing  resources,  we  prefer  using 
both  methods  over  either  one  by  it.self.  Note  that  when  a  simula- 
iton  IS  done,  rjp/icif  affrnfton  to  the  balancing  of  inconsistent  row 
and  column  sums  i.s  required^ 
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Another  alternative  arises  in  using  an  implicit  function  approach 
to  obtain  the  partial  derivatives  that  we  need.  Bacharach  (1970) 
does  this.  This  leads  to  a  generalized  inverse  of  a  singular  matrix. 
The  choice  of  a  particular  y-inverse  should  be  linked  to  behavioral 
circumstances.  Weber  (1987)  discusses  three  balancing  procedures  in 
real  world  settings. 

4.  More  General  Situations  and  Conclusion 

Obviously,  the  linearization  of  scaling  procedures  can  be  done 
over  3  or  more  subscripts.  Then  we  would  have  3  or  more  sets  of 
marginal  sums,  The  literature  seems  to  be  less  ex¬ 

tensive  on  existence  results  and  algorithms  for  arrays  satisfying  a 
certain  functional  relationship  to  an  initial  set  of  values  and  hav¬ 
ing  prescribed  marginal  sums.  In  debating  between  linearizations, 
simulations,  bedancing  procedures  and  generalized  inverses,  the  same 
problems  app^trently  remain.  If  the  initial  values  are  nonnegative 
rather  than  positive,  then  the  picture  is  complicated  somewhat.  The 
interested  reader  might  consult  Feinberg  (1983b,  1970)  or  Berman 
and  Plemmons,  (1979). 

To  conclude,  we  emphasize  again  that  the  error  propagation  from 
marginal  sums  to  scaled  values  depends  on  the  way  that  inconsistent 
marginal  constraints  are  brought  back  into  consistency.  Error  propa¬ 
gation  and  sensitivity  can  be  looked  at  via  covariance  matrices  as  we 
have  done  here  or  elasticities  or  derivatives  which  is  done  in  Weber 
(1987). 

Appendices  discuss  the  least  squares  consistent  marginal  sumribr 
arbitrary  sets  of  marginal  sums  as  well  as  the  convergence  of  deriva¬ 
tives  of  scaled  matrices.  It  would  be  nice  to  have  general  theorems 
or  rules  of  thumb  giving  bounds  on  the  variability  and  dependence  of 
covariances  of  marginally  adjusted  data  upon  the  way  that  marginal 
sums  are  kept  consistent.  This  might  be  a  worthwhile  future  ef¬ 
fort.  Finally,  implementing  (11)  using  the  parse  trees  as  suggested 
by  Sawyer  (1984)  may  be  especially  efficient. 

Endnotes 

1.  The  iteration  of  (6a)  (7a)  or  (6b)  (7b)  has  many  names  inc’ud- 
ing  the  Iterative  Proportional  Fitting  Algorithm  or  “IPFA”  (Fienberg 
and  Meyer,  1983a):  The  Iterative  Scaling  Procedure  or  “ISP"  (Plack- 
ett,  1974,  p.  32);  The  Deming-Stephan-Furness  Procedure  or  “DSF 
Procedure"  after  three  early  users  of  the  procedure  -  see  Sen  and 
Smith;  The  Furness  Procedure;  The  RAS  Procedure  after  the  ex¬ 
pression  giving  the  functional  form  of  diagonal  equivalence;  “paint¬ 
ing  a  matrix”  to  have  prescribed  row  and  column  sums  (George  W. 
Soules).  This  list  probably  is  not  complete. 

2.  Our  treatment  of  OD/dA  is  slighted  here  since  it  does  not 
depend  on  the  choice  of  {T\,T2)y  provded  that  (A/**,A/*^)  are  the 
same  at  some  evaluation  point  (A,  Af  ^  Af  ^)  for  different  (.^i ,  .F’2)’8- 
When  (A/**,  A/*^)  is  fixed  for  some  value  of(A/^ ,  M^)  and  A  remains 
fixed,  then  dAjdB  is  not  influenced  by  different  choices  of 

See  Weber  (1981),  Chapter  6  for  full  details  of  dB^ /dA,  dB^^ [dA, 
etc. 
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Appendix  2 

Sketch  of  Proof  of  Convergence  of  Derivatives 
The  reader  may  have  questioned  the  convergence  of 
[dC/dBM’'M''‘)(aR/dBM’'M’'‘)]''dB°M" 

as  n  becomes  large  as  well  as  the  relation  of  tins  expression  to  the 
derivative  of  the  limit  of  row  and  column  scalings.  This  question  is 
explored  at  length  in  Weber  (1981),  Chapter  6  and  is  lesser  detail  in 
Weber  and  Sen  (1985ab). 

In  fact,  the  sequence  of  matrix  powers  that  appear  in  (10)  do 
converge  and  the  limit  is  the  desired  Jacobian  matrix  of  derivatives. 
To  prove  this  we  use  a  basic  calculus  result  on  commuting  of  limits 
which  simply  concludes  that 


_ “Some  New  Forms  of  Spatial  Interaction  Models;  A  Review, 

Transporiaiton  Research,  9(1975),  167-179. 

Appendix  1 

Least  Squares  Consistent  Marginal  Sums 

Given  a  set  of  vectors.  A/*',  i/  =  1,..  ,l/.  of  marginal  sums, 
A/*'  =  (mj',  i  =  1,  ..,/4,),  LaGrange  multipliers  may  he  used  to 
obtain  revised  marginal  sums  such  that 

In  IV 

’^m"/  =  fori/=  1,.  .j(V'-  1)  (Al) 

1=1  1=1 

V'  /a, 

—  mj')^  is  minimized  (A2) 

»'  =  !  1=1 


The  LaGrangian 


£(.W,A)  =  ^^(m**'  -mn^+  -  j^m**') 

•  •  •  =l  1=1 


leads  to 


(A3) 


dc 


drn: 


—  =  2(m‘“  -  01“)  -  =  0  for  i' =  1, .  .  .(K  -  i)  (A4) 


for  i  =  1 . I„ 


ac 


dm, 


^  =  2(m*‘'  -  m'')  +  ^  A„  =  0  for  i  r-  1,  ,  (A, 5) 


ac 

ax„ 


=  ^m;'' -  =  0  for  1/ =  1 _ (K  -  1)  (A6) 


1=1  1=1 


Define  =;  ,  01“  for  </  =  1,  ,  V.  Then  (A  1)  ■  (A6)  lea<l  to 


1 

■(/jr  +  l)  1  11  ■ 

-  1 

(rn!;;  -m',)  ' 

tl 

1  (/j  +  1)  11 

(m\  -  mj) 

1 

1  1  ;...('^  +  l). 

.  (m+-m‘’-'). 

(A7) 

wherein  (A„)  denotes  the  vector  of  ( I'  —  1)  LaGrange  nniltipliers 


lim  lim/(r,!/)=  lim  lim/(i,j/)=  lim  f(z,y) 

jf  — u>  y  —  ('.>)-•(«' ,u() 

provided  (i)  limy_u, /(r,  y)  exists  for  each  x  in  a  relevant  set  and 
(ii)  lim,_v  fix,y)  exists  uniformly  for  the  set  of  y  vlaues.  (eg.  see 
Lang,  1968,  p.  134,  Theorem  6).  This  result  can  apply  here  by 
showing  (i)  that  every  finite  sequence  of  iterations  is  differentiable  and 
(ii)  that  sequences  of  Newton  quotients  converge  uniformly  for  some 
neighborhood  of  the  independent  variables.  The  first  requirement  is 
easily  met.  The  matrices  and  vectors  A,Af’',A/’^  are  positive  and 
the  iterative  scalings  are  alt  trapped  in  a  compact  positive  region  of 
^mn+m+n  ^  A,A/*',A/"^  remain  strictly  positive,  each 

scaling  is  differentiable. 

The  second  requirement  seems  to  require  more  lengthy  discus¬ 
sion.  In  a  compact,  sufficiently  small  ball  about  (B°,  Af*'.  A/*^),  the 
Newton  quotients  are  a  uniformly  Cauchy  sequence.  That  is,  for  all 
<  >  0,  there  is  an  integer,  N{().  such  that  whenever  k\ .  k'2  are  greater 
than  A^(f),  then  the  Newton  quotients  for  /bj.  and  k^  iterations  of 
{RC)  differ  by  less  than  f. 

The  above  facts  arc  precisely  the  requirements  of  the  theorem 
Consequently  the  limit  of  a  sequence  of  Newton  quotients  for  IPFA 
as  an  increment  to  a  marginal  sum  becomes  arbitrarily  small  and  the 
limit  of  using  arbitrarily  many  compositions  of  derivatives  of  IPFA 
scalings  commute.  In  particular,  we  may  let  the  increntents  go  to 
zero  and  proceed  to  a  derivative  of  the  limit  of  iterative  scalings  via 
derivatives  of  finitely  many  scalings.  Also,  the  convergence  is  rapid, 
hence  a  3rd  or  4th  power  is  shown  in  (10). 

A  cautionary  note  must  be  sounded  regarding  the  convergence 
of  compositions  of  derivatives  of  iterative  scalings.  Sinkhorn  (1967) 
and,  independently,  Weber  (1980)  showed  that 


L'  =  lim  («r)‘(.4)  =  (V  ml/Y  mr)  lim  (C«)*(.4) 

k  — oo  ^  k  — 


.=1  ;=1 


1  =  1 
2 


;  =  i 


(AlO) 


Where  R  and  C  denote  row  and  column  scalings  and  exponents  de¬ 
note  repeated  application  of  the  function  inside  the  parentheses,  etc. 
Now  differentiating  an  (r,  j)  entry  on  each  side  of  (AlO)  w.r  t.  rn} 
yields 


'^"•1  V  11%  1  /  ^'"'1 


(All) 


Setting  rn^.  =  rn^.  leads  to 


m**'  =  rnj'  +  (or  u  =  \  V  —  \  (A8) 

for  i  =  \  .  .  li, 

yV  -  1  ^ 

m;'"  =  m]'  -  for  i  =  1 _ ly  (AO) 

Sf’tting  V  =  2.  =  m.  =  n  in  (A7).  (A8)  and  (A9)  leads  to  (Id), 

(le)  pres#'nf.ed  in  tf*e  body  of  (he  paper. 


>  0. 


(A12) 


(hn\  f)mj 

Thus  any  computer  implement  at  i('n  of  linearizal  ions  of  scalings  should 
be  done  with  full  awareness  that  certain  Intermediate  calculations  do 
not  converge.  However,  we  observe  points  of  accumulation  when 
there  are  srf.s  of  marginal  constraints,  for  example,  a  set  of  row  sums 
and  a  set  of  column  sums  A  full  explanation  is  fouiul  in  Weber  1981. 
1987. 
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Optimizing  Linear  Functions  of  Random  Variables 
Having  a  Joint  Multinomial  or  Multivariate  Normal  Distribution 


J.  P.  De  Los  Reyes,  llniversity  of  Akron 


Introduction.  Let  have  a  joint  multinomial 

distribution  with  parameters  n,Pi,...,Pr  (Si/j-n,  Spj'l, 
Pi>0).  The  standardized  variables 

-np,)/inp|(l-p  )  ,  1  =  1 . r, 

have  a  limiting  joint  normal  distribution  with  means 
zero,  variances  one,  and  correlation  matrix  F  of  rank 
r-1 : 

F-D-pp',  D-diag[l/(i-p, ),..., 1/(1  -PrU, 

(1)  _  _ 

p'=  [  ,|pi/(I-p,),  ...  ,  .Jpf/d-Pr)  i. 

If  p,-l/r  for  1-1, ...,r,  then  F  is  equicorrelated  with 
common  correlation  p-— l/(l-r). 

Suppose  numbers  s, . s,  are  sought  that  minimize 

G(s)  subject  to  the  probabilistic  constraint  'kr(s) 

1-a  (0<a<l).  By  normal  approximation,  numbers 

x„...,Xr  are  then  required  which  minimize  F(x) 
subject  to  <Pr(x)  ;>  1-a,  where  we  define  for  constants 

*1-^0,  r  r  , - 

G(8)-  ^a^Sj,  F(x)-2;a,x,J^P|(l-Pi>  , 

1-1  1-1 

(2)  ♦r(s)-P{i/,  s,  ..  -K 

<br(x)"P(x,  X,  •!  •  }.  and 

'l>r(x,/))-P{yj  <^x,  .1  •> 

in  the  symmetric  case,  namely,  if  F  is  equicorrelated 
and  x-(x,x,...,x),  /.«.,  x  also  is  equicoordinate. 

If  r-2,  and  a.-a  -1,  then  binomial  probability 
vectors  (s„s.)  may  be  obtained  from  tables  of  the 
cumulative  binomial  probability  distribution  (Harvard 
L'niv.  Computing  Laboratory  19551  by  choosing  s.  and 
s.  so  that  the  tail  probabilities  on  either  side  of  the 
binomial  distribution  of  are  each  equal  to  a/.',  in 
effect  centering  the  probability  mass  l-oi(see  also 
Example  1).  The  direct  evaluation  of  the  multinomial 
sums  ♦r(s)  in  (2)  involves  considerable  difficulties  if 
r,j3,  while  the  corresponding  normal  probability 
integral  <t>r(x)  may  he  evaluated  by  numci.^al 
integration  (Milton  19'72|.  Since  vectors  s  that 
minimize  GIs)  can  be  obtained  from  those  that 

minimize  F(xJ  using  the  formula . .  _ 

s,-np,>Xijnp,U-Pil 

we  then  focus  attention  on  solving  the  normal  case. 

For  r  >2,  let  us  define  a  multivariate  analogue  of 
the  upper  probability  point  y.j  of  a  distribution,  <y,j 
IS  the  leaat  value  such  that  P{$  }  -I-))  to  be 

any  vector  y  for  wh,^h  l’\f ^  y^  ^i  =  l n)  1-a. 

These  upper  o  probability  vectors  are  generally 
nonunique  since  distinct  vectors  y  can  yield  the 
same  probabilities.  However  the  vectors  x  and  s 
named  above  are  unique  for  the  specified  singular 
normal  distribution  and  the  multinomial  distribution, 
respectively,  since  they  optimize  the  linear  functions 
F  and  G  of  random  variables. 

2j.  The  optimization  problem.  In  general,  probability 
vectors  may  be  found  as  follows:  Minimi.'e  1- 
c,x  +...+CrXr  (c^  0,  i-l,...,rl  subiect  to  (a)  equality 

constraints,  if  any:  FF  (x'-.A^  fi-l,...,m'  r>  and  (b) 
inequality  constraints,  if  any:  ll|(xi  ,_H|  li-l,..,nl 

where  x . Xp  are  values  laken  by  r  random 

variables  having  a  joint  distribution,  and  at  least  one 
of  the  constraints  involves  a  probability  distribution 
of  the  random  variables. 

The  nonlinear  programming  problem  to  find  optimal 
upper  o  probability  vectors  x  is  the  following: 

r  „ 

ni  Minimize  Fix'-  21  *i'''i  J  P/ * 'P|^  subject  to; 

1  •  1 


r 

(i)  <fcr(x),>l-a  and  (ii)  ^x 
1-1 

where  (ii)  is  included  in  order  that  the  probability 
function  in  (i)  is  nonnegative. 

The  solution  is  two-fold:  first  is  the  evaluation  of 
<l>r(x)  ;  second  is  the  optimization  part.  The 

optimization  routine  LPNLP  (Pierre  &  Lowe  1975)  and 
the  multidimensional  quadrature  subroutine  .MV.NORM 
(Milton  1972,  Bohrer  &  Shervish  19811  provide  a 
complete  numerical  solution  for  1  -^r-l  ;^10.  However, 
since  LPNLP  must  evaluate  the  probability  integral 
at  numerous  points  x  to  find  the  optimal 
probability  vector,  a  good  approximation  to  itirlx)  and 
other  related  "computer-ready"  formulas  that  will 
ease  up  the  computation  of  it>r(x)  are  valuable. 
Example  1^.  To  illustrate  the  bivariate  case  with 
a,-a  -1,  consider  the  following  "ice  cream  problem" 
(first  posed  by  L.  Takacs  at  Case  Western  Reserve 
University,  1979):  At  a  banquet  the  dinner  menu  lists 
ice  creams  of  two  flavors.  Independently  of  the 
others  each  of  the  1000  guests  may  order  an  ice 
cream  of  one  of  the  two  flavors  with  probability  i. 
Which  IS  the  smallest  number  of  ice  creams  of  each 
flavor  that  must  be  provided  to  insure  that  each 
guest  gets  his  or  her  choice  with  probability  ^•.9997 
.\  solution  using  norma'  approximation  is  given  by: 
Theorem! .  Let  iz  have  a  binomial  distribution  with 
parameters  n  and  p..  Let  ,-;-n-,/  ,  p  -l-p  .  Then  the 
numbers  s,,s_  for  which  s  +s  is  a  minimum  and  such 
that  (i)  s.+s  >n  and  (lOik  lsl^l-a  both  hold,  are 
given  by  s,=nP|+Xg  where  x  ^  •<1>''(1 -0/2', 

the  upper  o/2  probability  point  of  the  standard 
normal  distribution. 

Proof:  By  normal  approximation  and  Lagrange 

multipliers,  Xi-x  ,  and  the  conclusion  follows. □ 

The  answer  to  the  "ice  cream  problem"  is  thus  s  - 
s.-1000(.5'+3.29l[l000(.5i(.5']'  ■  or  about  552  ice 
creams  each  (agrees  wjth  values  obtained  using 
binomial  tables),  a  little  more  than  the  expected 
demand  of  500  each  but  much  less  than  the  maximum 
possible  demand  of  1000  ice  creams  of  each  flavor.  In 
this  singfe’period  inventory  model  (Hillier  & 
l.icberman  !%'"{  the  additional  104  ice  creams  are  to 
ward  off  shortages  that  may  arise,  with  probability 
at  most  .001,  from  a  variability  in  demand  rather 
than  from  any  delays  in  delivery  or  lead  lime  demand 
since  no  reorders  are  made. 

In  general,  for  in  (3',  the  optimal  pair  'X;,x  ' 

may  be  found  by  using  Lagrange  multipliers, 
expressing  x.  as  a  function  of  \  ,  tlicn  applying  a 
one>dimensional  search  procedure,  such  as  the 
bisection  method  IMcC'ormick  and  SaKadori  19f.4). 

’’nder  a  specified  multinomial  demand  in  a  single- 
period  model,  one  might  consider  multi-type  pr'''d‘Kts 
of  one  kind  such  as  ice  creams  r  different 

flavors,  concert  T-shirts  of  varK'Us  v.oL'rs  and  si.*es, 
Sparc  parts  of  a  discontinued  multicomp- 'oent  system, 
airline  or  train  seats  t'>  different  cities  at  a  kii'.en 
time  (compare  and  c.^nlrasl  wiUi  Teller's  rai)r'''ad 
tram  seats  cx.imple  If  ellcr  ISf'H.  p.lSS]'.  mam  dishes 
at  an  airline's  in-flight  meal,  dated  snack  items  at  a 
coin-operated  vending  macliinc,  kT  blvM'id  supply  at  a 
local  blood  bank.  Such  items  citfier  bcc'-'me  >‘bs‘>lctc 
quickly,  spoil  easily,  are  st  ’>-kod  up  only  '  nee.  or 
have  a  future  that  is  unv.erlain  bey^'nd  a  single 
period.  In  ea,.)i  »r  these  i.ascs.  an  upper  ^ 
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probability  vector  s  would  give  the  smallest  supply 
Sj  of  an  item  of  type  i  such  that  the  probability  of 
no  shortage  is  at  least  1-a. 

The  next  three  sections  deal  with  various 
formulations  of  the  integral  ^^ix)  in  an  effort  to 
find  simple  computer-ready  formulas  to  be  used  in 
the  anticipated  numerical  integration;  section  3 
presents  ^f(x)  as  a  single  iterated  integral  over  the 
given  simplicial  domain  of  integration;  section  4 
shows  that  <l>r(x)  is  a  sum  of  its  lower-dimensional 
marginal  distribution  integrals;  while  section  S 
expresses  ^  iterated  integrals  of 

the  uncorrelated  normal  distribution  over  certain 
simpitces  that  yield  nee  limits  of  integration. 

3.  Integration  over  a  simplex,  .“tn  m-dimensional 
simplex  IS  defined  to  be  the  intersection  in  m- 
dimensional  space  of  m+1  half-spaces  such  that  any 
m  of  the  bounding  hyperplanes  of  the  half-spaces 
meet  in  exactly  one  point,  a  vertex  of  the  simplex. 
Also,  any  m+1  points  which  do  not  lie  in  an  m- 
dimensional  space  are  the  vertices  of  an  m- 
dimensional  simplex  whose  elements  are  also  simplices 
formed  by  subsets  of  the  m+1  points,  namely,  the 
vertices  themselves,  the  edges,  the  l'"'' 

triangles,  the  •■"'l')  tetrahedrons  ,...,  in  general  the 
7'.''  cells  bounding  the  simplex  which  are  k-simplices, 
and  finally,  the  m+1  bounding  cells  or  faces  which 
are  m-1  simplices  Koxeter  19631.  When  the  m(m+l)/' 
edges  of  an  m-simplex  are  all  equal,  it  is  called  a 
regular  simplex. 

Example  2.  3-dimensional  simplex  (tetrahedron) 
has  4  vertices,  6  edges,  and  4  triangular  faces. 

Regarding  the  random  variables  X;  X.  i  as 

coordinates  of  a  point  x  m  r-l  dimensional  space  E'  '• 
then  <tr(x)  is  the  probability  that  x  falls  m  the  (r-D- 
simplex  g  ;  .  _ 

(4)  <2-{X,sX,  =i  •-Zx,jp,(l-P,l''Prll-Pr)-:-Xr> 

defined  by  the  system  of  r  inequalities  or  closed 
half-spaces  of  F.'  '  where  the  rth  inequality  ij  due  to 
the  singularity  condition  on  the  x,'s,  Tx  Jc  ;  .0. 

If  Gj  denotes  the  vertex  of  Q  obtained  by  omitting 
the  )th  inequality  in  (4)  and  solving  the  resulting 
subsystem  of  linear  equations,  then  has 

coordinates  in  terms  of  the  Pj  s: 

0  -(x  . X  .,  X*  X  ^  ,...,xr-  ).  j-l,...,r-l  where 

(5)  ,  _ 

Xj-  -  21  ,...,xr..). 

Direct  integration  over  Q  yields  the  following 
formula  for  <l>r(x),  where  without  loss  of  generality, 
Or-  's  joint  marginal  normal  density  function 

of  . . ,Xr-.i  with  covariance  matrix  K  derived  from 

It  by  deleting  row  r  and  column  r  of  F. 

Theorem  2.  For  any  constants  P|>0,  and  x^  such  that 

23p,'*  S>'iJPr'-Pi’ 

(6)  <^r<x)-P(X-:Q}  |  I  0,  It)  dtr.,...dt 


Zt,|p/'-p,'  Z  >;J^ip/'■PJ> 
4Pk<'-Pk' 


'  k'^^k- 


Proof:  Solve  for  Xr-;  from  the  last  two  inequalities 
in  (4)  to  gel  Lr-;',Xr.;  .Xr-.’l'r-;-  Solve  for  tf- 
f rom  the  (r-J)nd  inequality  in  (4)  and  the  inequality 
just  obtained  to  get  Lr-.  .tr-  .  ^r-  f  r-  >  'XP- 

down  to  I.  0.  V  :  .X  -I  ■  .0 


In  the  symmetric  case,  the  limits  in  (6)  simplify  to 


(7)  L|^  =  — (f— k)x  and  Uj^-x, 

which  agree  with  a  formula  given  earlier  by  Bland 
and  Owen(1966].  The  algorithm  MVNORM  (Milton 
1972)  is  incorporated  into  a  program  MNCDF  by  this 
author  to  evaluate  integrals  (6),  which  are  then  used 
with  LPNLP  to  solve  (3). 

^  Integration  over  infinite  rectangles.  Let  denote 
the  ith  inequality  defining  the  simplex  Q  in  (4),  so 
that  ♦r(x)”P{A,  n...  iSAr)  is  positive  if  and  only  if 
A,ri...riAr 5^=0  or  if  A,U...UAr-E'^  *.  By  the  Inclusion- 
exclusion  method  [Takacs  1967,  Feller  19681  the 
probability  Pj^  that  exactly  k  events  occur  among 

(iJ)Bj  .  k-0,1,2 . r  ,  where 

(8) 

Br  _  Z  P<A,,n  A,^n  ..n  a  }. 

Bj.  IS  the  kth  binomial  moment  of  the  number  np  of 

events  occurring  among  Ai,...,Ar;  define  Bq-\ . 

The  normal  integral  i^p(x)  is  expressed  as  a  linear 

combination  of  its  lower-dimensional  marginal  normal 

integrals  over  infinite  rectangles  as  follows: 

Theorem  3.  Let  <l>i.(x,  ,x . ,x,  )  denote  the  k-variate 

k  I,  I;  1,. 


nondegenerate 
X,  .X,  . X,  .  1 


marginal  distribution 


,  ,X,  ,....X,  ,  <l>j(x)-l.  If  <l>r(x)  >  0,  then 

li  1;  \ 


<J>r(x)-2:(-l)'' 


,Vj^, . 


-  Br-i  -  Bf.,  +  T1 

Proof:  P,  -  l-B.+B;  -  +(-l)'^Br-l-{A,  .  A;  .  ...  JAr) 

-  1).  The  conclusion  follows  on  noting  that  ^rlx)- 
P{A;  '’...  '  Afl'Br  and  PfAj  '•••  A,^l-<l>l^(Xj  . 

In  the  symmetric  case,  the  ('jj  terms  in  the  second 

summation  oi  (9)  are  identical  to  <t'j,_j(x,p),  yielding 

(HI)  ♦r'x,p}-2:(-l)-’'‘|j)'l>r.j(k.P) 


another  formula  by  Bland  and  Owen  119661  for  the 
equicorrelaled  normal  distribution.  In  a  way,  (6)  and 
•  9)  extend  the  earlier  formulas  (7)  and  (101, 
respectively,  to  the  case  when  F  is  not  necessarily 
equicorrelaled  but  satisfy  (1). 

Bohrer  and  Shervish  11981)  added  an  inviolable 
error  bound  to  the  algorithm  MVNORM  when 
computing  the  multivariate  normal  probabilities  of 
rectangular  regions  only,  which  this  author 
incorporated  ,nto  a  program  IXSk,  to  evaluate 
integrals  (9)  ordOl,  and  used  with  LPNLP  to  solve  (3). 
5.  Integration  over  orthoschemes.  A  diagonali7ation 
and  scaling  of  the  covariance  matrix  K,  simplifies  the 
integrand  of  <l>r(x1  in  (6)  but  then  the  limits  of 
integration  over  the  image  simplex  %  of  Q  turn  out 
to  be  complicated.  However,  by  dissecting  %  into  r' 
orthoschemes  0  which  are  multidimensional 
analogues  of  a  rignl  triangle  K’oxeler  1963),  and  then 
exploiting  the  symmetry  of  the  uncorrelated 
standardized  (i.e..  spherical)  normal  density,  each 
integral  over  an  orthoscheme  (7|  has  nice  limits  of 
integration.  .An  orthoscheme  0  is  a  k-dimensicnal 
simplex  such  that  for  some  ordering  of  its  vertices, 
say,  0  ,0  ,...,0|^,  then  all  the  lines  • 

0|^_  Oj^_  .  O  O  ,  0  0  are  mutually  perpendicular.  In 

fact  eavh  triangle  G|OjO|^  (i«j<k1  is  right-angled  at 
Oj.  If  k--3,  the  tetrahedron  is  known  as 
Quad  nr  octangular  since  all  of  its  faces  are  right- 
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angled.  L.  Schlafh  (18581  first  investigated  the 
content  of  a  hyper  spheric  a!  simplex,  or  simplex 
constructed  on  the  surface  of  a  hypersphere, 
through  a  dissection  of  the  given  simplex  into 
spherical  orthoschemes.  If  the  vertices  of  0  are 

projected  radially  onto  points  P  ,P, Pr.  ■  o"  a  unf 

hypersphere  centered  at  then  P.;P,  Pj^  ^  is  a 

spherical  orthoscheme. 

Example  3.  For  r  =  3,  the  integral  <t>2(x.,x  .x,)  equals 
that  of  a  bivariate  normal  over  a  triangle  or  3- 
dimensional  simplex  Q.  I'pon  transforming,  <l),(x)  is 
expressible  as  a  sum  of  bivariate  normal  integrals 
over  each  of  3!  right  triangles  or  orthoschemes 
formed  by  connecting  the  origin  to  each  vertex  of 
the  image  triangle  K  and  dropping  perpendiculars  to 
each  side.  By  symmetry,  the  integral  over  each  of 
the  SIX  triangles  is  equal  to  an  integral  over  a  right 
triangle  with  vertices  (0,0),  (h,,0),  (h;,h;),  which  are 
tabulated  as  V(h.,h  )  by  the  National  Bureau  cf 
Standardsll959). 

If  L  IS  a  diagonal  matrix  with  the  eigenvalues 
X  ,...,^r-i  K  on  Its  diagonal,  and  F  is  an  orthogonal 
matrix  such  that  P'KF->L,  then  the  variables 
f;,  .-,5r-'  defined  by 

(11)  f-L'‘''‘F'x 

are  jointly  distributed  normal  with  means  zero  and 
covariance  equal  to  1,  the  (r-1)  (r-l)  identity  matrix; 

moreover,  P(X’  Q)  ”  P(c-r%).  In  the  symmetric  case,  if 
L-diag|l/(r-l),  r/(r-l),  r/(r-l),...,r/(r-l)|  ICraybill 

1969)  and  F  equals  the  transpose  of  Helmert's 
original  matrix  (Lancaster  1965],  nameiy: 


'  1  tF’i 

oi:  : 

1  f .T 

I  >f.  : 

■-A- 

(\2)  P- 

i 

.  sFi 

1  •irli.r.i,  , 

i 

,  nT-I 

1 

then  the  image  y  of  Q  under  the  transformation  (11) 
turns  out  to  be  a  regular  simplex  T  with  center  at 
origin  and  edge  length  e-X'^FTTi-,  x  given  as  in  (2). 
The  vertex  coordinates  of  T  are  precisely  the 
columns  of  the  matrix 
(13)  V-L‘'’^-P'G 

with  the  vertex  coordinates  (5)  of  Q  forming  the 
columns  of  G. 

The  rotation  matrix,  corresponding  to  F  in  (11), 
that  will  diagonalize  an  arbitrary  nonsingular 
correlation  matrix  other  than  K  is  unknown  in 
general.  An  iterative  method,  credited  to  Jacobi 
(Carnahan  et  a/.  19691  transforms  a  real  symmetric 
matrix  into  diagonal  form  by  applying  a  succession  of 
plane  rotations. 

The  (r-l  )-dimensional  regular  simplex  -  T 

may  be  subdivided  into  r'  congruent  orthoschemes 
}-J(i.  ,  ,  J;J-,  where  ^  is  the  center  of 

■^fr-  )  center  of  any  one  of  the  (r-2)- 

dimensional  bounding  cells,  to  be  denoted  by  of 

Jr-,  IS  the  center  of  any  one  of  the  (r-3i- 
dimensional  bounding  cells,  to  be  denoted  by  y*^’’,  of 
y"^’';  ...  ;  J,  IS  the  center  of  any  one  of  the  1- 
dimensional  bounding  cells,  to  be  denoted  by  T‘,  of 
y‘:  l.e.,  J-  IS  the  midpoint  of  any  one  of  the  edges 
that  bound  a  face  of  and  finally,  J,  is  either 

one  of  the  two  endpoints  (vertices  of  that 

bound  an  edge  Y‘  of  the  regular  simplex. 

Since  each  edge  y'  is  divided  into  2  segments, 
each  2-dimen8ional  cell  Y‘  or  triangular  face  has  3 
such  edges,  each  3-dimensional  cell  Y  or  tetrahedron 


has  4  such  triangular  faces,  ...  .  each  ik-li- 

dimensional  bounding  cell  y'"'  has  k  such  ik-.'l- 

dimensional  bounding  cells,  and  finally,  the  ir-D- 

dimensional  simplex  y’’"-  itself  has  r  such  ir-2)- 

dimensional  cells,  therefore  the  total  number  of  ways 
of  joining  the  center  )[•-;  of  to  a  center  J^- 

of  y*^'  ,  to  a  center  J^-  of  Y^'\  ...  ,  to  a  center 
or  midpoint  J,  of  an  edge  y',  to  either  endpoint  J  of 
an  edge,  is  precisely  (r)  (r-l)  3  2-r'  .  All  of 

the  r'  orthoscheraes  thus  formed  are  congruent  to 
one  another  because  y  is  a  regular  simplex. 

An  iterated  integration  formula  for  evaluating 
<l>i-(x,ni  (symmetric  case)  using  plane  crthoschemes 
shows  that  the  resulting  domain  of  integration  is 
reduced  to  just  I  'r'  of  the  regular  simplex  b.-^ause 
of  the  symmetry  of  the  uncorrelated  normal  density: 
Theorem  4.  Let  v,....,tr  have  a  joint  (singular)  normal 
distribution  with  F(X|)'(),  Var(H|)-l,  E(ii|(.j)* -I '  (r-D- 
Let  f,,...£r..  have  a  joint  normal  distribution  with 
F-(f,)“<),  kanf|)-l,  Eif|fJ”()  and  density  function 
®(t . Ir-.)-  For  any  x  •  3).  then 

•tirfx,,?)  -  r'  0r-  (x)  where 

X  b.t,  b  t  br.  tr. 

114)  er.;(x)  -  J  I  j  I  (PdMtr-:  dt 

0  fl  0  I! 

and  b|-ij(r-i  +  l)  (r-i-1 )  (  l•'l,...,r-2). 

Proof:  Since  •tr'x.k’l  •  Pit,  ,x(i- 1  ,...,r))-  PUx  . <r.;' 

-^Ql  '  PKf r-.*'" '''"F  It  suffices  to  show  that: 

(a)  For  some  orthoscheme  )*  in  the  simplicial 
subdivision  of  y,  then  Plf-:  l*l*©r-  ■X'^l 

(b)  If  J'  IS  any  one  of  the  r'  orthoscliemes  of  y 
distinct  from  )*,  then  Pl{-.)'1-  Plf-)*!. 

To  prove  (a)  first  let  31  --  Hr- Hr..  H.H  be  any 
orthoscheme  having  vertices 

(15)  Hr-r'C . l».  Hr-  •’(^,,0 . U).  Hr- lh,,h.,CI . IP. 

...,  H.-(h; . hr..,0),  H  -di; . hr-., hr.  ), 

where  _ _  _  _ 

h  -x^l r(r-l )'(r-i)(r-i  +  l )  (I'l ,...,r  1'. 
Thus  PIJ-.TO)  -  PIfr-;  0.  er-i'.:.Hir.  'hr-.'fr-..  «r-. 

^  (hr-Thr-dfr-  .  f.  t^th./hjiG,  t- ‘..h  1  -  ©r.  (x) 

since  1+  *1 ,  ~  b  if  j-l,...,r-2,  while  h.-x. 

Next  let  )*•  Jr-  Jr-  J;J  be  the  orthoschcme  T 

y  defined  in  terms  of  the  vertices  of  Y  as 
follows:  Jr-,  is  the  center  of  the  simplex  formed  by 
A.....,\'r;  Jr-.  IS  the  center  of  the  simplex  formed  by 
V,,...,Vr.,;  and  so  on;  J.  is  the  center  of  the  simples 

formed  by  \  and  ,  l.e.,  J  is  the  midpoint  of  \ 

and  V,;  J  is  the  point  ^  Let  H  and  J*  be  the 

matrices  with  columns  Hr.,,...,H  and  Jr . ,J 

respectively  .  t  hen 

P(t-J*)-P(f-  %)-©r.:(Xi 

since  there  exists  an  orthogonal  transformation 
matrix  K*sucli  that  H-KJ*  ,  namely. 


(16)  K*- 


To  prove  (b),  let  .I'-Jr-iJJ-.  J'  J'  be  any  of  the 
r'  congruent  orthoschemes  of  f,  each  having  Jr.  as 
a  common  vertex,  }'  distinct  from  ,1*.  let  J  be  the 
matrix  with  columns  Jr-  ,  ,J'  There  exists  an 
orthogonal  transformation  matrix,  say  T,  such  that 
J*-  r.I',  where  T  consists  of  a  rotation,  or  a 
reflection,  or  a  composition  of  rotations  and 
reflections.  Hence  P!(  r)-P(f  1*1.0 
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The  following  integral  recurrence  formulas  follow 
directly  from  Theorem  4  above: 

X 


where 


(Ml 


X  r  - 

I  0,(x)=f  «(t)0,,  tj^t^ldt 
0  X  “ 

■Drlx.^il-r  J  i»(t)<l)r.,(t  l-''-,-ijdt 
0  ^ 


Formula  (1“'-m)  was  derived  earlier  by  Ruben(1960l, 
Steck  and  Owenll962|  and  Johnll96fcl.  Ruben  used  a 
"method  of  sections"  where  Tb,  a  regular  simplex 
centered  at  (0,. is  first  divided  into  r  simplices 


d  by  joining  the  centroid 


probability  content  of  each 


to 

i. 


the  r  vertices.  The 
is  then  obtained  by 
passing  (r-l)-flats  parallel  to  the  face  opposite  that 
vertex  of  which  coincides  with  the  centroid,  and 
adding  up  the  probability  contents  over  the  sections 
or  slabs  between  parallel  flats.  Steck  and  Owen  used 
conditional  probabilities  and  by  repeated  application 
of  (17-u)  arrived  at  (14).  John  indicated  the  use  of 
probability  integrals  V(hi,...,h|. )  over  special  simplices 
he  called  "k+l-hedrons"(which  are  orthoschemes 
with  vertices  as  in  (15)),  to  evaluate  the  probability 
of  convex  polyhedra  in  k-dimensional  space  under 
normal  and  t  distributions,  deriving  (17-ii)  for  his 
"k+l-hedrons". 

If  U[.“f  r  — the  extreme  order  statistic  minus  the 
mean  in  normal  samples  of  r  observations,  with 
1,  then  <l>p(x,e)^P{Ur  •_^X'J(r-l  )/  r).  A 


E(f,)-0, 


Var(f,) 


recurrence  formula  for  Ur  (Grubbs  1950,  David  1970) 
used  in  computing  its  percentage  points  may  be 
obtained  independently  using  the  integral  recurrence 
formula  (P-ii)  above. 

Example  4.  Consider  the  symmetric  case  for  r-'6, 
Pi=  •  -Pe-l/b  so  that  p--_l/5,  and  let  X;=  =x.--1.5. 

Number  in  boxes  denote  [CF^  on  the  VA>.VVMS  3.1. 
From  (10)  and  program  IXSK, 

<PJ1.5,-1/S)’B,  B^+B  B+B.-l 

-  :  ■<!>.(  1 .5,- 1  /5)  - 1  "‘o.f  1 .5,- 1  /5)+-  ^  O ,( 1 .5,- 1  /5) 

-  ^<t);(l  .5,-1 /5)+'^<t),(l  .5,-1 /5)  - 1 

where 

<I>.(1.5,-l/5)-.6S41  1.0001 

<5.(1. 5, -1/5)-. 7437  013  -1.0000  001 
f  (1.5, -1/5)-. 8050  531  ±.0000  001 
<I>:(1.5,-l/5)-.868:  136±.0000  001 
<I>:(1.5,-l/5)-.9331  9:6  ±.0000  001 
Hence  <l>f(l  .5,-l/5)-.6J!6  ±  .001 . 

From  (14)  and  program  M.NCDF, 

<t>e(1.5,-l/5)-6'  ■  (.0008  6998  ±  .0000  00(1 1 ) 

Hence  (Iijl  .5,-l/5)-.6:6  38±.()00  01. 

Karlier  attempts  to  compute  <1>.(1 .5,-l/5)  using 
formulas  (6)  and  (7)  with  MNCDF  have  been 
unsuccessfu'  due  either  to  nonconvergence  or  to  this 
author’s  tendency  to  abort  the  run  whenever  it  was 
taking  “much  too  long". 

^  Approximations  to  ^(x).  By  the  Bonferroni 
Inequality  [Kot/  and  Johnsop  198JI, 

-  P(A,  -|  /Ar)  I-ZPIA;")  -  so 

••  J  I  • 

(18)  Dr(x)  ^  S<I>(x,)-(r-l) 


where  aJ"  denotes  the  complement  of  A^.  If  r-J,  (18) 
gives  the  exact  value  of  <l>[.(x). 

Let  S.  denote  the  kth  binomial  moment  of  the 
number  of  events  m^  occuring  among  A?,  ,.Ar.  Then 

<tr(x)  -  P{A, ‘  A;  '  Ap)  -  Po,  the  probability  that 
exactly  none  of  A^,  ,  Ap  occurs.  From  (8), 

P,  -2^(-l)‘Sp,  where  S.-l  and  Sp*P{Afi':  /  Ap)  -  0 

imply  that  alternatively,  the  normal  integral  equals 

(19)  <l>p(x)-  1  -  S,+S,  -  ±(-l)''''Sp.,, 


Z  E{A‘i  ■  ;Af^)  ,  k-1 . r-1. 


yielding  a  finite  sequence  of  approximations, 

(JO)  (Pp^’tx)  =  1 -S,+S:  -  •j;(-l)*'S,,  k-1 . r-1, 

with  (18)  as  first  approximation  when  k-1,  and  having 
error  bounds  S  ,  the  first  neglected  term: 

S,-  2:  P(Af  .  iAf  }-  Z  P{X,,.:  X,  ,  X,  X,_), 

Cl)  . 

or  S-C;)  •  Plx,  cx,  1-1,:;  c--l/(r-l)} 

in  the  symmetric  case.  The  sum  of  an  odd  number  of 
terms  provides  an  upper  bound  and  the  sum  of  an 
even  number  a  lower  bound,  counting  the  first  term, 
1.  The  bounds  increase  in  sharpness  with  the  number 
of  terms  included  and  the  magnitude  of  the  error  e, 
in  the  kth  approximation  does  not  exceed  the  first 
neglected  term.  The  following  summarizes  results  on 
these  bounds: 

Theorem  5.  (i)  Bonferroni  inequalities:  For  r>:  ,  and 
m-l,...,r/:,  „  , 

^(-D^S^  <t>p(x)  i:(-l)% 

(ii)  Improved  Bonferroni  inequalities:  For  r^J,  sj>0, 

z'(-l)‘s^  +  (^")S,s+_  <t>r(x) 


(ill)  "Best  upper  bound"  using  only  S,  and  S;:  Let 
(x)  denote  the  greatest  integer  in  x. 

f:s. 


•Iirix) 


1  - 

k+1  ‘ 


k(k+l)^=  ’ 


k-l+' 


S. 


Proof:  (i)  See  Fellerll 968),  Kotz  &  Johnsonll98:), 
Davidll970),  GBlambosll9751.  (ii)  See  Sobel  &  L'ppulun 
1197:),  Galambosll 975).  (iii)  See  Dawson  &  Sankoff 
(1967),  Kounias  &  Marinll9761,  GalambDsll9781. 

Table  1  below  shows  how  good  an  approximation 
(18)  IS,  considering  it  involves  only  the  univariate 
normal  distribution,  based  on  computed  values  of  the 
error  bounds  Cl)  using  programs  MNCDF  and  IXSK. 


The  values  of  x  were  grouped  together  according  as 
.5  ■  10'^'‘<,S;(x)i,.5  •  lO'". 

Table  1 

Values  of  x  =  xtr,l)  such  that  for  x.;x(r,t)  the  function 
<l>p' (x)-l -S,  approximates  the  true  value  of  <I>p(x,p) 
to  t  or  more  correct  decimals,  and  3;^'^30. 

Lt  \  r-  3'" 

4 

5 

6 

7 

8 

9  .  J 

1  0.70 

1 .05 

i.:5 

1.40 

1.50 

1 .60 

1.65 

1.15 

1.50 

1  .70 

1.85 

1.95 

:.05 

:.05 

3  1.55 

1.90 

:.io 

:.:5 

:.35 

:.45 

:.5o 

4  1 .85 

:.:5 

:.45 

:.6o 

:.7o 

:.8o 

:.85 

I  't  \  r-10 

11  ‘ 

i: 

n 

14 

15 

1  1.75 

1.80 

1.85 

1.90 

1 .90 

1.95 

:.oo 

:  :.:o 

:.:5 

:.3o 

:.3o 

:.35 

:.4o 

:.40 

3  :.55 

:.60 

:.65 

:.7o 

:.75 

:.75 

:.80 

4  :.90 

:.95 

3.00 

3.05 

3.05 

3.10 

3.10 

[t  \  r-17  ' 

18 

"19  ' 

:o 

:i 

•A 

]  :.6o 

:.05 

:.05 

:.io 

:.io 

_ 

:.i5 

:.45 

;.45 

:.50 

:.5o 

:.5S 

:.55 

:.55 

bo 

o 

:.85 

:.85 

:.90 

:.9o 

:.9o 

:.95 

4  3.15 

3.15 

3.:o 

3.:o 

3.:5 

3.:5 

3.:5 

It  \  r-:4 

:5 

:6 

'21 

:8 

:9 

irio:] 

1  :.:o 

:.:o 

:.:o 

:.:5 

:.:5 

2.25 

:.3o 

:  :.6o 

:.6o 

:.60 

:.65 

:.65 

:.65 

:.7o 

3  :.95 

:.95 

3.00 

3.00 

3.00 

3.00 

3.05 

4  3.:5 

3.. 30 

3.30 

3.30 

3.30 

3.30 

3.35 

Table  J  gives  the  corresponding  values  of  Op(x,t>) 
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at  x-x(r,t)  as  given  in  Table  1.  It  is  often  the  case 
that  in  (3),  a  is  taken  equal  to  0.30  or  less,  and 
Table  2  indicates  for  example,  that  if  l-o^'-.90,  the 
approximation  yields  4  correct  decimals  if  r-3,  and 
only  2  correct  decimals  if  r-30.  As  the  dimension  r 
increases,  fewer  correct  decimals  are  obtained  for 
the  same  a-value.  Note  however  that  for  l-o  >.99, 
(18)  gives  4  correct  decimals  for  r-3, ....30. 

Table  3 

Values  of  <t>r '(x)-l-S,  ^  <I>r(x,/))  at  x-x(r,t) 


w  here 

x(r,t)  IS  given 

in  Table  1 

It  \  r-  3 

4 

5 

6 

-T 

8 

9  n 

1  .2741 

.4126 

.4718 

.5155 

1334"” 

.5616 

.5548 

3  .6348 

.7328 

.7773 

.8071 

.8209 

.8385 

.8392 

3  .8183 

.8851 

.9107 

.9267 

.9343 

.9429 

.9441 

4  .9035 

.9511 

.9643 

.9730 

.9757 

.9796 

.9803 

It  \  r-10 

11 

13 

13 

14 

15 

I  .5994 

.6048 

.6141 

.6367 

.5980 

.6162 

.6361) 

3  .8610 

.8655 

.8713 

.8606 

.8686 

.8770 

.8688 

3  .9461 

.9487 

.9517 

.9549 

.9583 

.9553 

.9591 

4  .9813 

.9835 

.9838 

.9851 

.9840 

.9855 

.9845 

|t  \  r-17 

18 

19 

30 

21 

23  "1 

1  .6133 

.6367 

.6165 

.6437 

.6248 

.6529 

.6371 

3  .8786 

.8714 

.8830 

.8758 

.8869 

.8815 

.8761 

3  .9566 

.9607 

.9585 

.9637 

.9608 

.9590 

.9635 

4  .9861 

.9853 

.9869 

.9863 

.9879 

.9873 

.9867 

It  \  r-34 

35 

26 

37 

28 

39 

30  1 

1  .6663 

.6534 

.6385 

.6699 

.6577 

.6455 

.6783 

2  .8881 

.8835 

.8788 

.8913 

.8873 

.8833 

.8960 

3  .9619 

.9603 

.9649 

.9636 

.9632 

.9609 

.9657 

4  .9862 

.9879 

.9874 

.9869 

.9865 

.9860 

.9879 

Example  5.  Continuing  Example  4,  formula  (19)  and 
IXSK  together  yield  in  the  symmetric  case, 
<t>J1.5,-l/5)-l  -  S,  +  S,-Sj  +  S.-S. 

-1-  ff!<tf(1.5,-l/5)  +  ('i-li^d.S.-l/S)  - 
|'i1>f(1.5,-l/S)  +  i|)<tJ(I.5,-l/5)-;;:<t>f(1.5,-l/5). 
where  we  define 


<|i^(1.5,-l/S)-P{Afrc  O  A?)  so  that 


<I>r(l.S,-l/5)“.0668  069 

hV 


.0018  282  1-. 


4)V(1.5,-l/5)- 
<1>J(  1.5,- 1/5)- .0000  098 
<I>?(1.5,-l/5)-.0000  00 
<I>|(1.5,-l/5)-.000 


1 

0.01 

1 

0.13 

1 

1.48 

28.57 

94.17 

E 

24lTf 

±.0000  01 
±.001 

Hence  <I>e(l .5,-l/5)  -  .63638  ±.001 
From  (18),  .5,-l/5)=ar  1-S,”6  ^  <|ii(x)  -  5 

^.5991  586  ±.0000  01 
The  error  of  this  appoximation  does  not  exceed  S  , 
and  from  (31),  S--.0374  23  r  -0000  01. 

Hence  <1>,(1 .5,-l/5)  -  .5991  586±.0374  .^3 

From  Table  1,  for  r-6,  x-1.5,  /.e.,x  1- 1 .40,  the  number 

t  of  correct  decimals  is  1. 

Since  the  integrals  in  these  formulas  for  <l>r(x) 
must  be  computed  numerically  (unless  otherwise 
known),  it  is  possible  for  an  approximation  using  (30) 
to  have  an  actual  error  greater  than  the  first 
neglected  term  because  of  errors  in  the  numerical 
quadrature.  It  is  wise  to  plan  to  compute  enough 
decimal  digits,  in  anticipation  of  the  additivity  of 
the  error  bounds  under  linear  combinations,  so  as  not 
to  render  the  results  meaningless.  On  the  other  hand, 
there  is  no  need  to  compute  numerically  the  integral 
terms  much  more  accurately  than  the  specified  error 
bounds. 


7.  Solution  of  the  total  optimization  problem, 

Up  till  now  the  emphasis  has  been  on  evaluating 
the  singular  normal  integral  i^rfx),  with  four 
formulas  (6),  (9),  (14)  and  (19)  ready  for  computer 
implementation.  Note  that  while  both  t3|^'s  in  (9)  and 
Sj^'s  in  (19)  are  sums  of  lower-dimensional  integrals 


over  infinite  rectangles,  the  Sj^'s  in  fact  yield 
“upper  tail  probabilities"  of  these  marginal  normal 
distributions  and  therefore  decline  rapidly,  (see 
Examples  4  and  5)  yielding  approximation  ilS). 

To  put  the  problem  in  a  form  suitable  for  LPNLP, 
rewrite  (3)  equivalently  as  a  maximization  problem: 

r  . - 

Maximize  Fix)-  -  23 ®i'^i  J Pi'-' "P/'  subject  to; 
(22)  1-1  r 

(i)  —  <J>r(x)  — (1-0  )  and  (ii)  ^  Xj|p|(  1 -p^j  n. 

I-  1 

W.'ith  (ISj,  the  gradient  of  the  probability  function  in 
(23-i)  has  only  univariate  normal  density  functions 
for  Its  components.  Since  <t>(xi  is  a  concave  function 
over  x.-O,  the  set  S  of  feasible  solutions  is  a  closed 
convex  set.  Therefore  the  absolute  maximum  of  F 
over  S,  F  being  linear,  is  the  only  local  maximum 
over  S  IPierre  and  Lowe  19751. 

Program  OPRVEC  implements  LPNLP  on  the  VAX 
to  solve  (22),  with  calls  made  to  a  suitable  version  of 
MNCDF  whenever  ib^tx)  is  to  be  evaluated.  Copies  of 
OPRVEC  are  available  at  cost  by  writing  to  J.  P.  de 
los  Reyes,  Department  of  .Mathematical  Sciences, 
L  niv.  of  Akron,  Akron,  Ohio  44325;  tel;(216)  375-7193. 

Simulation  is  an  alternative  approach  to  solving 
(22)  or  similar  problems  involving  other  probability 
distributions  such  as  Poisson.  On  the  other  hand,  the 
methods  of  parallel  programming  might  speed  up  the 
numerical  quadrature  portion  of  the  present  solution 
in  which,  for  instance  in  the  nonsymmetric  case  of 
(14),  the  integral  Or.  lx)  must  be  evaluated  r'  times 
but  with  different  upper  limits. 

Example  6.  Consider  finding  an  optimal  inventory 
policy  for  a  local  hospital's  blood  bank,  in  the  sense 
that  the  daily  supply  Sj  of  type  i  blood  is  a  minimum 
under  a  preset  probability  constraint  that  no 
shortage  occurs  that  day  with  probability  at  least  I- 
a.  Suppose  that  the  daily  demand  of  human  blood 
of  type  1  (i-l,...,r)  has  joint  multinomial  probability 
distribution  with  parameters  n,  p,,...,pr,  where  p^  is 
the  probability  that  the  bank  receives  an  order  of 
one  unit  of  type  i  blood  independently  of  other 
requisitions  and  there  arc  n  orders  received  daily. 

First,  estimates  of  the  numbers  p^,  based  on  actual 
percentages  used  (2369  pints  total)  over  30  randomly 
selected  days  within  January  to  October  of  one  year 
were  found  to  be: 

I  1-  1  ~  2 _ .5  .  _6_  "  ~  s7  1 

type:0+  .A+  13+  A—  0-  AB+  B  AB 

I'iT-  ^7’  731)9 ".1 17  7)7g'7n5J  7ri3‘2“7i32  .oiiV'j 

Next,  there  being  no  other  constraints  other  than 
supply  levels  s^  having  a  least  total,  we  set  a^'s  1  in 
function  G  and  consequently,  aj's'I  m  F  also.  Now 
solve  the  normal  case  (22)  for  x,  from  the  given 

values  of  a^,  p^,  r,  and  desired  values  of  a.  (se 

OPRVEC,  with  approximation  (18)  for  quick  results, 
to  obtain  Table  3.  Equicoordinate  vectors  (x,...,x)  for 
the  symmetric  case  are  in  the  last  row  of  Table  3. 

Finally,  the  corresponding  optimal  multinomial 
vectors  (s,,...,s.)  representing  the  minimum  daily 
blood  supply  levels,  for  the  desired  risk  levels, 
assuming  n-lOOO,  are  summarized  in  Table  4.  The 
first  column  of  Table  4  gives  the  expected  demands 
based  on  the  estimates  of  P|  above.  Recall  the 

assumption  that  no  resupply  can  occur  in  the  same 
day,  hence  note  the  overstocking  that  increases  as 
the  probability  l-o  of  no  shortage  increases.  If  we 
assume  that  p  -p  -...-p  -1/8,  then  the  average 
demand  is  125  at  n-lOOIl.  1  he  optimal  supply  levels  s 
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and  totals  are  shown  in  the  last  two  rows  of  fable 
4. 

Table  3 


Normal  Probability  Vectors  for  blood  Bank 


(see 

Example 

6) 

rx\l-o=  .90 

.95 

.99 

.995 

.999 

X  -  1.9686 

2.2538 

2.8217 

3.0391 

3.4853 

X  -  1.9953 

t  ''77  J 

2.8404 

3.0565 
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T  able  4 

Multinomial  Probability  Vectors  for  Blood  Bank 
_  (see  Example  6) _ 
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For 

comparison 

with 

other 

values 

already 

published,  OPRVEC, 

when 

run  with  (18), 

obtained 

upper 

o-.lO  and  .005  probability 

vectors 

in  the 

symmetric  case,  which  agreed  to  2  and  3  decimal 
places  respectively,  to  those  obtained  through 
percentage  points  of  the  order  statistic  u^  mentioned 
after  d"’)  above.  When  a|'s'l  in  the  symmetric 
case,  there  is  really  no  need  to  use  OPRVEC  on  (-21 
since  it  is  plausible  that  the  required  vector  x  must 
be  equicoordinate,  thus  -  or /r),  the  upper  o/r 

probability  point  of  the  standard  normal  distribution. 
OPPVEC  of  course  produces  the  same  equicoordinate 
vectors  x  in  this  case. 

Practical  applications  of  probability  constrained 
or  "chance-constrained"  programming  in  general 
include  the  models  of  minimum  cattle  feed  (Bracken 
and  McCormick  19681  and  hog  feed  rations  (Pierre  and 
Lowe  19''51  under  probabilistic  protein  constraints,  an 
optimal  cost  nutrition  program  under  probabilistic 
nutrient  level  constraints  (Prekopa  19701,  and  an 
optimal  spare  parts  kit  for  a  multicomponent  system 
in  which  demand  for  spares  is  generated  by 
component  failures  having  an  exponential  distribution 
IProschan  19601. 
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APPROACHES  FOR  EMPIRICAL  BAYES  CONFIDENCE  INTERVALS  FOR  A  VECTOR  OF  EXPONENTIAL  SCALE  PARAMETERS 


Bradley  P.  Carlin  and  Alan  E.  Gelfand,  University  of  Connecticut 


ABSTRACT 

Parametric  empirical  Bayes  methods  of  point 
estimation  date  to  the  landmark  paper  of  James 
and  Stein  (1961).  Interval  estimation  through 
parametric  empirical  Bayes  techniques  has  a 
somewhat  shorter  history,  which  is  summarized  in 
the  recent  paper  of  Laird  and  Louis  (1987).  In 
the  i.i.d.  exchangeable  case,  one  obtains  a 
"naive"  EB  confidence  interval  by  simply  taking 
appropriate  percentiles  of  the  estimated  poste¬ 
rior  distribution  of  the  parameter,  where  the 
estimation  of  the  prior  parameters  ("hyper¬ 
parameters")  is  accomplished  through  the  marginal 
distribution  of  the  data.  Unfortunately,  these 
"naive"  intervals  tend  to  be  too  short,  since 
they  fail  to  account  for  the  variability  in  the 
estimation  of  the  hyperparameters.  That  is, 
they  don't  attain  the  desired  coverage  proba¬ 
bility,  both  in  the  classical  sense  and  in  the 
"EB"  sense  defined  in  Morris  (1983a). 

In  this  paper  we  consider  two  methods  for 
developing  EB  intervals  for  exponential  scale 
parameters  which  attempt  to  correct  this  defi¬ 
ciency  in  the  naive  intervals.  The  first  is  a 
"bias  corrected  naive"  method  inspired  by  Efron 
(1987).  Simply  put,  this  method  adjusts  the 
naive  intervals  using  tail  areas  determined  by 
the  parametric  structure  of  the  model  and  the 
data.  The  second  method  uses  a  parametric  boot¬ 
strap  (Laird  and  Louis,  1987)  to  match  a 
specified  hyperprior  Bayes  solution.  Finally, 
through  simulation  we  compare  methods  with 
respect  to  EB  coverage  and  length. 


1.  INTRODUCTION 

Consider  the  i.i.d.  exchangeable  Bayesian 
formulation  where  Y^,...,Yp  ^  Gamma  ( v.  ,6.j ) , 

i  =  l,...,p  independent,  v.  known  and  the  E-'s 

have  the  conjugate  inverse  gamma  (IG)  prior, 
lid 

B^,...,Bp  IG(a,b),  a,b  >  0.  We  take  v.  =  1 

for  convenience,  though  the  case  of  general 

known  v-  is  discussed  briefly  in  Section  2. 

'  1 

Thus  f(y^le^)  =  exp(-y./6.),  y^  >  0,  s-  >  0, 

g(s.|a,b)  =  exp(-l/B^b)/(i(a)b®f;^'^b ,  a,b  >  0, 

1  =  l,...,p.  Hence  the  Y.'s  are  marginally 
i.i.d.  with  distribution  ' 

f(yi|a,b)  =  ab/(by.+l)^^\  y.  >  0  (1.1) 


and  the  posterior  distribution  of  is 
IG(a+l,(y.+l/b)'^),  i.e., 

exp(-y.+l/b)/B.) 

f(n.|y.,a,b)  =  — — — — • — /v.ii  t.o 
r(a+l)(y.+l/b) 


(1.2) 

Taking  the  scale  parameter  b  =  1 ,  we  view  a  as 
unknown,  and  estimate  it  from  the  marginal 


P 

distribution  of  Y^,  f(y|a)  =  n  f(yJa).  For  EB 

i  =  l  ] 

point  estimation  a  best  choice  of  a  (e.g.,  MLE, 
UMVUE,  moments  estimator)  is  not  clear.  Not 
surprisingly,  this  same  difficulty  arises  in 
developing  EB  confidence  intervals.  Usual  esti- 

P 

mators  of  a  take  the  form  a  =  c/  r.  log(Y.  +  l). 

c  i  =  -|  1 

For  instance,  c  =  p  yields  the  MLE  while  c  =  p-1 
yields  the  UMVUE.  Choosing  one  of  these  as  our 
estimate  of  a,  the  "naive"  EB  confidence  interval 
for  B.j  is  simply  the  upper  and  lower  a/2-points 

of  the  "estimated  posterior,"  i.e.,  (1.2)  with  a 
replaced  by  a.  These  intervals  are  called, 
"naive"  because  in  ignoring  randomness  in  a  they 
tend  to  be  too  short.  More  precisely,  Morris 
(1983a, b)  defines  an  EB  confidence  set  of  size 
1-a  as  a  subset  t  (X)  of  q  such  that 

P(9  £  t  (Y)  >  1  -  a  (1.3) 

a 

where  the  probability  is  calculated  over  the 
joint  distribution  of  e  and  X-  T^he  naive  inter¬ 
vals  generally  fail  to  satisfy  (1.3). 

In  this  paper,  we  propose  two  methods  to 
correct  this  deficiency  in  the  naive  intervals. 

In  Section  2  we  introduce  a  method  for  bias- 
correcting  the  naive  interval,  and  discuss  some 
of  its  properties.  Section  3  develops  a  method 
which  matches  any  hyperprior  Bayes  solution.  A 
parametric  bootstrap  (Laird  and  Louis,  1987)  is 
used  in  place  of  numerical  integration.  Section 
4  obtains  simulated  coverage  probabilities  and 
interval  lengths  for  the  methods. 

2.  THE  BIAS  CORRECTED  NAIVE  APPROACH 

Efron  (1987)  proposed  a  general  framework  for 
correcting  the  bias  in  naive  EB  intervals.  In 
our  setting,  a  simpler  bias  correction  may  be 

developed.  From  (1.2)  we  see  that  the 

quantile  of  the  posterior  distribution  f(s^.|y^.,a) 

is 

q^^(a)  =  2(y.+l)/D‘j3^.,)(l-a)  (2.1) 

-1  2 
where  Dj^  is  the  inverse  c.d.f.  of  a  x  distri¬ 
bution  with  k  (not  necessarily  integer)  degrees 
of  freedom.  Define 

n(a,a,<t)  =  Plp.  <  q^(a)|B.'v.f(p.|y.,a)} 

=  P(.-i  l2(y.+l)/D-J.^^)(l-,0 

|B."4G(a+l,(y.  +  l)'b) 

=  P'S-  1  2/D-j^^^j(l-..)|'i'-IG(a+l,l)] 

(2.2) 

where  ‘  .  =  By(y.  +  1).  Thus  i.(a,a,i)  =  1 
-  °2(a+l)^°2(a+l)'^-'^^-  l^t 
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Ti(a,a)  -  |g[p(a,a,a)].  (2.3) 

For  the  UMVUE  a, ala  IG(p, (a(p-l ) )'^ ) . 

Note  that  (2.3)  is  a  naive  EB  tail  area  in 
the  Morris  sense,  (1.3).  Typically,  Ti(a,l-;i/2) 

-  Ti(a,a/2)  <  1-a,  that  is,  the  naive  intervals 
are  too  short.  It  is  usually  argued  that  this 
undercoverage  arises  because  we  are  failing  to 
take  into  account  the  variability  in  a.  In  any 
event,  suppose  we  solve 

ii(a,a' )  =  a  (2.4) 


for  a'.  This  a'  would  "correct  the  bias"  in 
using  a  in  our  naive  procedure.  Applying  (2.4) 
would  produce  intervals  with  exactly  the  desired 
coverage  probability.  But,  of  course,  we  can't 
solve  (2.4)  since  a  is  unknown.  Instead,  we 
propose  to  solve 

iT(a,oc')  =  u  (2.5) 

to  obtain  a'  =  u'(a,c().  Then  we  take  as  our  bias 
corrected  naive  EB  confidence  interval  the  naive 
interval  with  “u"  replaced  by  "n"'.  Under  mild 
regularity  conditions,  this  procedure  gives  a 
unique  confidence  interval.  We  note  that 
correcting  a  to  a' (a, a)  is  equivalent  to  cor¬ 
recting  q^(a)  to  q  i(a),  which  in  turn  is  equi¬ 
valent  to  correcting  the  quantiles  of  the 
estimated  posterior.  Also  note  that  since  we 
were  able  to  "scale  out"  y-  in  (2.2),  the 
integration  in  (1.3)  could'be  done  by  integrat¬ 
ing  over  a|a.  Equation  (2.5)  can  be  solved 
using  a  one-dimensional  numerical  integration 
(we  transformed  the  IG  to  the  interval  (0,1)  and 
used  16-point  Gaussian  integration--see  Abramo- 
witz  and  Stegun,  1967)  with  one  rootfinder 
(using  false  position) . 

We  can  extend  our  work  to  the  full  Gamma/IG 


ind 

problem,  i.e.,  Y,.  -v  Gamma(v,,3.-)  where  v- 
iid’  ,  ,  ’  ’  ’ 

known  and  6^  IG(a,b),  i  =  l,...,p.  Again  we 

take  b  =  1.  (Note  that  this  case  includes  the 
2 

X  scale  problem.)  One  can  show  that 

Y.|a  r(v.+a)/(r(v.)r(a))-y^  V(y^+1)  ’  a 

Pearson  Type  VI  distribution  (Johnson  and  Kotz, 
1970).  Again  we  can  scale  out  y-  as  in  (2.2), 
and  now  |a  IG(v.+a,l).  Note  chat  now  the 

MLE  a  is  no  longer  available  in  closed  form. 


P 

However,  since  T(a)  =  log(y.  +  l)  is  decreas- 
i  =  l  ’ 

ing  in  a,  we  can  use  the  distribution  of  T(a)  to 
implement  the  bias  correction. 

Before  concluding  this  section,  we  address 
the  question  of  whether  the  bias  correcting 
method  actually  produces  EB  confidence  intervals 
in  the  M-^rris  sense.  Since  from  (2.5)  /  is 
random,  we  need  to  look  at  the  tail  area 


"al. 


i  -^'(a..) 


(a)|:,  IG(a+l,l)}  (2.6) 


Eg,  p(a,a, (a, .)). 


While  exact  evaluation  of  this  expectation  is 
not  possible,  Carlin  and  Gelfand  (1988)  show 
that  (2.6)  falls  in  an  interval  containing  .. 
In  fact,  (2.6)  is  bounded  above  by 


a  +  max(I^,l2)  and  below  by  u  +  min(I.|,l2),  where 

1,  =  /  [a'(a,u)  -  o(a,a,a' (a,  .))]dF(ala) , 
a>a 
and 

I2  =  /  [  i'(a,_.)  -  p(a,a,a' (a,a))]dF(a|a). 
a<a 

Moreover,  the  simulations  in  Section  4  indicate 
that  this  method  does  achieve  EB  coverage,  (1.3). 

3.  THE  PARAMETRIC  BOOTSTRAP  APPROACH 

Several  authors  (Deely  and  Lindley,  1981; 
Rubin,  1982;  Morris,  1983a,b,  1987;  Laird  and 
Louis,  1987)  in  the  PEB  setting  have  attempted 
to  account  for  the  variation  in  estimating  a 
hyperparameter  by  introducing  a  hyperprior  dis¬ 
tribution.  Quantiles  of  the  resulting  "marginal 
posterior"  are  used  in  place  of  those  of  the 
estimated  posterior.  We  note  that  while  this 
approach  is  not  directly  aimed  at  developing 
intervals  with  the  desired  EB  coverage,  it  is 
generally  applicable  and  has  worked  well  in  our 
empirical  studies.  In  our  exponential/inverse 
gamma  setting  let  us  place  a  hyperprior  T(a)  on 
a.  This  induces  h(a|y),  which  in  turn  induces 
P 

m(e|y)  =  /  _n^f(fi. |y-,a)dH(a|y),  (3.1) 

the  marginal  posterior.  Using  the  MLE  for  a, 
which  is  sufficient  for  (1.1),  and  the  flat 
hyperprior  T^(a)  =  1^^  the  marginal 

posterior  for  3^  simplifies  to 
"'(e^ly^.a)  =  /f(B^  |y^  ,a)dH(ala) 

=  /f(B.j  |y.j  .a)*Gamma(p+l  ,a/p)da. 

(3.2) 

In  this  setting  the  Type  III  parametric  boot¬ 
strap  of  Laird  and  Louis  may  be  used  to  approxi¬ 
mate  (3.2).  It  calls  for  drawing  B*  i.i.d.  from 

G(B|a),  and  then  Y*  independently  from  f(y!3*), 
i  =  l,...,p.  We  then  compute  a*  from  the  Y*'s 
in  the  same  way  that  a  was  obtained  from  the  Y.'s. 

Thus  a*|a  is  distributed  as  F(a*|a)  * 

=  lG(p,l/(ap) ) .  Obtaining  N  bootstrapped  a^'s 

in  this  fashion,  we  use  the  mixture  distribution 
N  *  N  *  , 

L  f(Bilyi,aJ/N  =  >:  lG(a.+l,(y.+l)'')/N 

j=l  ’  '  J  j=l  J  ’ 

(3.3) 

to  approximate  (3.2).  The  EB  confidence  interval 
for  is  then  computed  by  finding  the  ,/2  and 

1  -  t/2  points  of  (3.3). 

Note,  however,  that  the  expected  value  of 

(3.3)  is 

/  f(,  .|y..a*)dF(a*|a).  (3.4) 

As  Hill  (1987)  notes,  F(a*|a)  is  not  the  same  as 
H(aia),  and  thus  (3.4)  may  be  a  poor  approxima¬ 
tion  for  (3.2).  Again  the  empirical  success  of 

(3.3)  suggests  that  this  may  not  be  an  important 
issue,  especially  since  the  link  between  (3.2) 
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and  desired  EB  coverage  is  tenuous.  Nonetheless, 
how  can  we  achieve  a  better  approximation  to 

(3.2)  for  our  problem?  Consider  the  integral 

/  f(B-|y.,  a^/a*)-(a/a*)-f(a*|a)da*  .  (3.5) 

Next  let  b*  =  a  /a*,  so  that  b*|a''-Ganinia(p,a/p). 
(Note  that  b*  is  conditionally  unbiased  for  a.) 
Using  this  transformation,  a  little  algebra 
shows  that  (3.5)  is  equal  to 

/  f(3.j  |y.j  ,b*)-Gamma(p+l  ,a/p)db*  (3.6) 

which  is  identical  to  (3.2).  Thus,  instead  of 

(3. 3)  ,  we  would  use 

z  IG(a^/a*+l,(y.+l)''')-(a/a*)/N  (3.7) 

j=l  J  1  J 

and  take  the  uppe''  and  lower  j/2-points  of  this 
distribution  as  our  confidence  interval  for  e^. . 

Ragunathan  (1987)  suggests  a  modification  of 
(3.7)  to  a  truly  weighted  average 


/f(eiiyi.a)  •  Gamma(p,a/p)da.  (3.9) 

But  since  we  know  b*|a  ■>.  Gamma(p,a/p) ,  the  a/a* 
term  in  (3.5)  is  no  longer  needed.  The  approxi¬ 
mation  to  (3.9)  becomes 

z  IG(a^/a*  +  1,  (y.+l)''')/N.  (3.10) 

j  =  l  J  ' 

As  we  shall  see,  this  approach  proves  better 
than  (3.8).  This  is  perhaps  because  ■^2  is  a 

more  appropriate  hyperprior  for  a  shape  para¬ 
meter;  we  are  matching  a  more  reasonable 
hyperprior  Bayes  solution. 

As  a  final  simplication,  note  that  if  we 
replace  a^/at  by  its  expectation,  (3.10) 
becomes  ^ 

IG(a+l,(y.+l)'^)  (3.11) 

which  is  the  estimated  posterior.  Thus  not  only 
are  (3.7),  (3.8),  and  (3.10)  approximations  to 
(3.2),  but  we  may  also  view  them  as  ways  to 
incorporate  bootstrap  variation  into  the  naive 
EB  intervals  in  an  effort  to  "lengthen"  them. 

Again,  while  exact  evaluation  of  the  coverage 
probabilities  of  these  intervals  is  not 
possible,  we  shall  see  in  the  simulation  results 
of  the  next  section  that  the  bootstrap  intervals 
do  generally  achieve  the  desired  EB  coverage. 

4.  SIMULATED  COVERAGE  PROBABILITIES 
AND  INTERVAL  LENGTHS 

In  this  section  we  present  and  discuss  the 
results  of  a  simulation  study  which  compares  the 
methods  for  our  Exponential/IG  problem.  Since 
we  are  working  in  the  EB  framework,  coverage  was 


evaluated  in  this  context.  That  is,  for  fixed 
a  and  p,  we  generated  i- . '  s  i.i.d.  as  IG(a,l),  and 

then  generated  the  Y.j's  independently  as 

Exponential (n. ) ,  i  =  l,...,p.  Each  simulation  is 

based  on  3000  replications;  for  the  methods 
requiring  a  bootstrap,  we  used  N  =  400  bootstrap 
trials  per  replication. 

Tables  4.1  -  4.4  show  lower  endpoint,  upper 
endpoint,  interval  length  (all  averaged  over  both 
i  and  the  replications)  and  individual  and  simul¬ 
taneous  EB  coverage  probability  for  the  classical, 
naive  EB,  bias  corrected  naive  EB,  Laird  and 
Louis  bootstrap,  and  hyperprior  matching  boot¬ 
strap  methods  (3.8)  and  (3.10)  (corresponding  to 
hyperpriors  t.|  and  12'  i^espectively,  for  p  =  5, 

10,  true  a  =  2,  5,  and  nominal  individual  cover¬ 
age  probabilities  y  =  .90  and  .95.  Recall  that 
in  the  bias  corrected  method,  the  choice  of  this 

estimator  a  affects  three  parts  of  the  procedure: 
the  computation  of  the  tt  function  (2.3)  (we  need 

the  distribution  of  a|a),  the  actual  solution  of 
(2.5),  and  in  the  estimated  posterior  distribu¬ 
tion.  (The  last  of  these  three  is  the  only 

place  a  shows  up  in  the  naive  procedure.)  In  our 
simulation,  for  the  naive  and  bias  corrected 
naive  we  show  results  obtained  using  the  marginal 

UMVUE  a  =  (p-l)/y,  log(y.+l).  Results  (not  shown) 

obtained  using  the  marginal  MLE  a  =  p/7.  log(y^+l) 

gave  longer  (i.e.,  too  conservative)  bias  cor¬ 
rected  intervals  (extending  further  to  the  right), 
but  shorter  naive  intervals.  For  the  three 
bootstrap  methods,  we  also  used  the  UMVUE  for  a, 
but  this  time  because  the  Laird  and  Louis  MLE 
intervals  generally  failed  to  attain  the  nominal 
coverage  probability.  The  two  hyperprior 
matched  intervals  were  insensitive  to  the  choice 

of  a.  This  was  what  we  expected,  since  we  were 
matching  the  same  hyperprior  Bayes  solution, 

regardless  of  the  choice  of  a. 

In  terms  of  comparing  the  methods,  several 
observations  can  be  made.  As  expected,  the 
classical  intervals  faithfully  achieve  the  desir¬ 
ed  coverages,  but  are  unacceptably  long--in  one 
case  (p  =  10,  a  =  5,  Y  =  -95)  more  than  ten  times 
longer  than  the  better  EB  intervals.  As  has  been 
noted  by  previous  authors  (Morris,  1983a, b; 

Laird  and  Louis,  1987),  the  naive  EB  intervals 
perform  surprisingly  well,  especially  for  small 
a  and  large  p.  Yet  in  no  case  do  they  achieve 
the  desired  coverage;  for  small  p  and  large  a 
they  are  especially  poor.  The  bias  corrected 
naive  intervals,  on  the  other  hand,  are  slightly 
conservative,  though  not  significantly  so  (the 
average  coverage  probability  numbers  have  a 
standard  error  of  about  .5%).  In  addition,  the 
bias  corrected  intervals  have  lengths  that  are 
quite  competitive  with  those  for  the  two  boot¬ 
strap  methods  shown,  especially  when  y  =  .90. 

The  bootstrap  methods  also  produce  intervals  that 
generally  hit  the  desired  coverage.  Notice  that 
the  intervals  based  on  matching  the  flat  hyper¬ 
prior  i.|  generally  fail  to  achieve  the  desired 

coverage  probability.  By  using  the  hyperprior 


j=l 


IG(a^/a*+l,(y.+i: 


)‘(a/a.)/  r  (a/a.). 
^  j=i 


Since  E(a/a*(a)  =  1,  this  modification  seems 
reasonable  and  works  better  computationally. 
If  instead  of  our  flat  hyperprior,  we  use 


(3.8) 


-,(a)  =  1/a  •  I 


(0,» 


,(a),  then  (3.2)  becomes 
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TABLE  4.1:  a  =  2.  p=5 


Average 

Upper 

Endpoint 

Average 

Interval 

Length 

Average 
Indivdual 
Cov.  Prob. 

Simul¬ 
taneous 
Cov.  Prob. 

y  =  .90 

Classical 

.355 

19.5 

19.2 

90.1 

.59.1 

Naive  HB 

.355 

3.51 

83.9 

46.8 

Bias  Corrected 

.331 

4.41 

89.7 

60.8 

Laird  and  Louis 

.3.39 

4.81 

90.4 

63.2 

T|  Matching 

.287 

MX 

2.95 

86.8 

55.6 

Tj  Matching 

.311 

4.(K) 

3.69 

89.4 

61.0 

y  =  .95 

Classical 

.268 

38.8 

95.2 

78.3 

Naive  EB 

.306 

5.22 

90.0 

64.4 

Bias  Corrected 

.285 

7.55 

95.2 

79.0 

I.aird  and  Louis 

.283 

7.79 

7.. 50 

9.5.4 

80.9 

T,  Matching 

.246 

4.46 

4.51 

74.7 

Tj  Matching 

.265 

5.93 

5.66 

79.8 

TABLE  4.2:  a  = 

5,  p=  5 

Average 

Average 

Average 

Average 

Simul- 

Interval 

Lower 

Upper 

Interval 

Indivdual 

taneuus' 

Method 

Endpoint 

Endpoint 

Length 

Cov.  Prob. 

Cov.  Prob. 

11 

O 

Classical 

.084 

4.89 

4.81 

89.9 

59.0 

Naive  EB 

.134 

.690 

.5.56 

77.1 

35.4 

Bias  Corrected 

.116 

1.03 

.914 

90.2 

66.1 

Laird  and  Louis 

.114 

1.04 

.928 

89.9 

66.9 

T|  Matching 

.092 

.620 

.528 

86.3 

61.1 

Tj  Matching 

.102 

.810 

.708 

90.1 

67.9 

y  =  .95 

Classical 

9.87 

9.81 

94.8 

76.1 

Naive  EB 

.859 

.739 

84.6 

52.4 

Bias  Corrected 

1.67 

1.57 

95.6 

82.8 

Laird  and  l.ouis 

.096 

1.41 

1.31 

95.1 

81.3 

T|  Matching 

.081 

.816 

.735 

91.8 

75.7 

Tj  Matching 

.089 

1.10 

1.01 

94.7 

81.1 

TABLE  4.3:  a  =  2.  p=  10 


Interval 

Method 

Average 

Lower 

Endpoint 

Average 

Upper 

Endpoint 

Average 

Interval 

Length 

Average 
Indivdual 
Cov.  Prob. 

Simul¬ 
taneous 
Cov.  Prob. 

y  =  .90 

Classical 

.324 

18.9 

18.5 

90.1 

35.1 

Naive  EB 

.341 

3.20 

2.86 

87.3 

27.9 

Bias  Corrected 

.327 

3.56 

3.23 

90.2 

37.3 

Laird  and  Louis 

.320 

3.53 

3.21 

90.4 

37.9 

T|  Matching 

.294 

2.77 

2.48 

88.6 

33.3 

T;  Matching 

.307 

3.11 

2.80 

89.8 

36.4 

y  =  .95 

Classical 

.262 

38.2 

37.9 

95.0 

.59.3 

Naive  EB 

.295 

4.42 

4.12 

93.0 

50.6 

Bias  t'orrccted 

.283 

5.23 

4.94 

95.3 

62.4 

Laird  and  Louis 

.276 

5.07 

4.79 

95.3 

62.7 

r,  .Matching 

.260 

3.98 

3.72 

94.2 

.58.8 

Tj  Matching 

.265 

4.42 

4.16 

94.9 

60.8 

TABLE  4.4:  a  =  5,  p=  10 


Interval 

Method 

Average 

Lower 

Endpoint 

Average 

Upper 

Endpoint 

Aver.'ge 

Interv.'! 

Length 

Average 
Indivdual 
Cov.  Prob. 

Simul¬ 
taneous 
Cov.  Prob. 

y  =  .90 

mm 

('lassical 

4.86 

4.78 

90.0 

35.5 

Naive  LB 

.577 

.4.M) 

21.2 

Bias  Corrected 

.116 

.710 

.594 

41.7 

Laird  and  Louis 

.115 

.714 

..599 

44.2 

T|  Matchins 

.103 

.555 

.452 

88.5 

40.0 

Tj  .Matching 

.109 

.633 

.524 

90.1 

4.3.0 

y  =  .95 

Classical 

.068 

9.82 

9.76 

94.9 

.59.1 

Naive  LB 

.114 

.701 

..587 

89.5 

40.8 

Bias  Corrected 

.104 

.956 

.853 

95.3 

67.0 

Laird  and  Louis 

.101 

.914 

.814 

95.1 

66.3 

T|  Matching 

.091 

.701 

.610 

93.5 

62.6 

Tj  .Matching 

.097 

.811 

.714 

94.8 

66.2 

12  which  puts  more  weight  on  small  values  of  a, 

these  intervals  are  shifted  to  the  right  and 
now  attain  the  nominal  EB  coverage  level.  Fin¬ 
ally,  note  that  these  last  intervals  are  also 
remarkably  short.  For  example,  when  p  =  5, 
a  =  2,  and  t  =  .95,  the  t2  matching  intervals 

attain  the  desired  coverage  on  the  average,  yet 
are  only  74%  as  long  on  the  average  as  the 
Laird  and  Louis  intervals. 

5.  CONCLUSION 

In  this  paper  we  have  developed  two  methods 
for  computing  empirical  Bayes  confidence  inter¬ 
vals  for  a  vector  of  exponential  scale  paramet¬ 
ers  that  take  into  account  the  uncertainty  in 
estimating  hyperparameter  a.  We  have  defined 
and  illustrated  a  method  of  bias  correcting  the 
usual  naive  EB  intervals,  and  also  given  a  boot¬ 
strap  method  by  which  we  can  match  a  hyperprior 
Bayes  solution,  with  associated  approximations. 
Our  simulation  study  indicates  that  the  bias 
corrected  naive  method  is  a  strong  candidate, 
and  also  that  modifying  the  Lairo  and  Louis 
Type  III  bootstrap  to  approximate  a  different 
marginal  posterior  can  offer  substantial  improve¬ 
ment  in  interval  length  without  sacrificing 
coverage  probability. 
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1 .  INTRODUCTION 


2 .  BACKGROUND 


If  observed  varieibles  X-j  ,  X2 ,  and  X3  are  thought 
to  be  best  represented  in  a  regression  analysis  as 
a  sum  of  a  true  variable  value  {X*)  and  an 
unobserved  random  measurement  error  value  (X**); 

X  =  X*  +  X**,  ( 1 ) 


then  a  relatively  simple  procedure  is  available 
for  estimating  the  appropriate  regression 
coefficients.  The  first  step  involves  merely 
estimating  three  regression  equations: 


X)  =  a. 

+ 

bi2X2 

bi3X3 

+  .  . 

..  +  bi„Xn 

(2) 

X2  =  32 

+ 

b2iXi 

b23X3 

+  .  , 

,  .  +  b2nX„ 

(3) 

X3  =  33 

+ 

b3iXi 

+ 

*^32’'2 

+  .  . 

..  +  b3nXn. 

(4) 

The  aj^  and  b^j  are  standard  OLS  coefficients. 

Equations  are  next  reexpressed  in  terms  of  one 
of  the  relationships  of  interest.  To  obtain  the 
required  set  of  coefficients  for  the  relationship 
between  variable  1  and  variables  2  and  3,  we 
move  X^  to  the  left  hand  side  of  the  equals 
sign  in  (3)  and  (4)  and  solve  for  X^;  that  is, 


+  bi,X 


12'^2 


,  1 


t  bnX 


,2^2 


.  1 


I3A3  ... 


+  (1/“b2'|)X2  +  ( b23/“b2 1  )X3 


-  B‘.,X 


12^2 


Xi  =  +  (b32/-b3i)X2  *■  (-1/-b3i)X3 


(2a) 


(3a) 


(4a) 


Klepper  and  Learner  (1984)  have  shown  that  sets  of 
coefficients  such  as  ^^12*  where  k(k  =  1,2,3)  is  a 
direction  of  minitnizacion ,  are  maximum  likelihood 
bounds  for  the  relationship  between  variables  1 
and  2,  under  the  assumption  that  true  variables 
and  unobserved  random  error  variables  are  normally 
distributed.  They  have  demonstrated  this  for 
multiple  regressions  with  several  independent 
variables  containing  a  random  e.*ror. 

In  this  paper  we  will  first  discuss  the 
intellectual  heritage  of  the  procedure.  Next, 
a  simple  method  of  estimating  the  regression 
coefficient  bounds  is  set  forth.  Colllnearity 
diagnostics,  which  are  obtained  directly  as  part 
of  this  procedure,  are  noted.  Frisch's  regression 
strategy  is  then  discussed  and  applied  to  a 
previously  published  regression  analysis.  Finally 
a  strategy  for  an  errors-in-variables  regression 
analysis,  which  is  Inherently  Bayesian,  is  used 
to  reduce  the  bounds  for  the  estimated 
coef  f ic ients . 


The  recommended  approach  was  discussed  and 
Applied  at  length  by  Ragnar  Frisch  (1934). 

Although  the  first  editor  of  Econometrica , 

Frisch  was  more  of  a  data  analyst  than  an 
econometrician,  as  these  terms  are  understood 
today.  He  generally  believed  that  the 
the  assumption  of  error-free  independent 
variables  required  in  Fisher's  maximum  likelihood 
approach  to  least-squares  was  highly  unlikely  to 
be  encountered  in  applications  with  economic  data. 
Fisher's  approach,  however,  was  being  Increasingly 
accepted  by  avant-garde  econometricians  such  as 
Koopmans  ( 1937)  because  it  supplied  an  elegant 
formal  framework  for  a  regression  analysis,  and 
because  it  lent  itself  well  to  forecasting. 

Haavelmo  (1944,  pp.  52-55)  suggested  that  the 
Frisch  approach  to  regression  analysis  was,  xn 
fact,  appropriate  as  a  general  stochastic 
representation  for  certain  types  of  econcxnic 
behavior,  not  just  as  a  method  of  evaluating  the 
consequences  of  errors-in-variables  in  a  standard 
regression  analysis.  Malinvaud  (1981)  noted 
that  the  estimation  of  regression  coefficient 
bounds  recommended  by  Frisch  received  little 
attention  in  the  past  because  it  Imposed  a 
computational  burden  and,  more  importantly,  because 
it  was  originally  developed  in  a  non-stochastic 
setting  as  a  data  analysis  tool.  With  modern 
computers  and  software  it  is  no  longer 
computationally  burdensome.  Kendall  and  Stuart 
(1979,  pp.  379-380),  Patefield  (1981),  Kalman 
(1982),  Klepper  and  Learner  (1984),  and  Becker 
et  al .  (1985)  have  variously  derived  the  bounds 
within  well-defined  stochastic  contexts  for  a 
variety  of  regression  models.  Since  there  are 
many  examples  of  regression  analyses  in  the 
social  and  physical  sciences  in  which  many  of 
the  variables  in  a  regression  analysis  are  best 
modelled  as  containing  a  random  error,  this 
change  in  circumstances  recommends  increased 
application  of  this  relatively  simple  procedure. 

3.  THE  COMPUTATIONAL  PROCEDURE 


To  obtain  the  required  coefficient  bounds,  the 
colllnearity  indices,  and  the  coordinate  values 
for  displaying  the  bounds  graphically,  a  matrix 
of  cofactors  R'  is  computed  for  a  correlation  matrix 
R  of  all  the  variables  of  interest  in  the  analysis; 


p,, ... 

=1  .  -I 

I  .  -I 

[fnl  ••• 


:1 1 


n  1 


.iTn 


When  the  elements  of  R*  matrices  are  presented  in 
tabular  form  Frisch  designated  these  matrices  as 
tilling  tables.  The  regression  coefficients  are 
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calculated  as  ratios  of  elements  in  a  tilling 
table  with  the  following  formula: 

B'^ij  =  (5) 

The  procedure  used  to  construct  the  diagnostic 
graph  is  to  move  a  a  distance  equal  to  the 
denominator  value  along  the  horizontal  axis  and 
then  to  move  upwards  or  downwards  a  distance 
equal  to  the  numerator  value  in  (6),  that  is: 

+  1 


'  V 

I  I  (if  b''.  .  >  0) 

I  ki  i 

I  ^  , 

I  I  (if  B*'.  .  <  0) 

I  S/ 

I 

-1 

The  number  of  bounds  (k)  is  determined  by  the 
number  of  variables  defined  by  1.  These  graphs 
Frisch  designated  bunch  maps.  The  extreme  beams 
in  a  bunch  map  are  regression  coefficient  bounds. 

The  diagonal  elements  in  R'  are  the 
col linearity  indices.  They  are  related  to  the 
collinearity  indices  recently  recommended  by 
Stewart  (1987)  as  ideal.  Stewart  Interprets  these 
indices  within  the  context  of  the  explanatory 
variables  in  a  regression  equation.  Since  in 
Frisch's  scheme  we  are  interested  in  the  linear 
connection  between  all  variables,  not  just  the 
natural  dependent  variable,  the  collinearity 
indices  are  interpreted  within  the  context  of  all 
the  variables.  The  collinearity  indices  are 
readily  shown  to  be  related  to  the  familiar 
multiple  correlation  coefficient  (MCC)  between 
variable  i  and  the  j,  ...  n  other  variables 
considered  in  the  regression  analysis,  that  is: 

MCCi  j  ^  (6) 

where, 

|r|  =  the  determinant  of  the  matrix  R 

The  inverse  of  iRl/r*-^  in  (6)  is  the  collinearity 
indice  favored  by  Stewart  (1987).  Since  |r|  is 
constant  for  any  set  of  variables  being  evaluated, 
the  comparison  of  several  r^^/lR|  is  equivalent 
to  the  exeuninations  of  several  r^^.  As  an 
indication  of  collinearity  problems,  Frisch 
recommended  examining  the  r^-i.  For  a  particular 
correlation  matrix,  the  greater  the  number  of  the 
r^^  that  are  similar  in  value,  and  the  smaller 
the  magnitude  of  the  r^^,  the  greater  the 
chance  that  collinearity  is  a  serious  problem  in 
obtaining  reliable  coefficients.  Frisch  also 
recommmended  the  examination  of  the  ratios 
r^^/lRl,  not  as  collinearity  indices,  but  as 
Indicators  of  fit.  The  greater  the  difference 
in  magnitude  of  an  r^i  imd  an  associated  |R|, 
the  greater  the  increase  in  the  fit  of  a 
relationship  by  adding  a  variate. 


4.  THE  PROBLEM 

In  two  related  articles  by  Herbert  (1986,  1987a), 
a  regression  equation  was  estimated  by  OLS  using 
state  level  annual  data.  The  equation  was  evaluated 
by  a  battery  of  regression  diagnostics  as  well  as 
by  a  test  of  the  hypothesis  that  state  and  temporal 
variance  components  were  equal  to  0.  This  latter 
test  of  the  null  hypothesis  was  not  rejected. 

The  estimated  regression  equation  expressed 
natural  gas  demand  per  customer  in  a  period  t 
(GD)  O.S  a  linear  combination  of  the  price  of  gas 
(PG),  the  price  of  electricity  (PE),  income  (Y), 

GD  in  the  previous  time  period  (GD(t-l)),  and  an 
indicator  of  average  space  heating  requirements 
per  customer  (WH).  The  indicator  WH  was  constructed 
based  on  changes  between  years  in  heating  degree 
days  and  in  the  proportion  of  space  heating 
customers  among  the  customers  in  a  state.  All 
economic  variables  were  expressed  in  constant 
dollars  and  all  variables  were  expressed  in 
logarithmic  form.  Except  for  WH,  the  estimated 
equation  was  a  conventional  econometric  formulation 
of  GD  using  state  level/annual  data  le.g.  Beierlein 
et  al.  (1981)  and  Blattenberger  et  al .  (1983)). 

Additional  analysis  of  the  economic  model  by 
Herbert  (  1987b)  indicated  that  WH  represented  the 
space  heating  capital  stock  portion  of  the  capital 
stock  surrogate  variable  GD(t-l)  and  this  fact,  along 
with  the  remaining  variables  included  in  the 
specification  suggested  dropping  GD(t-l)  from  the 
regression  equation  and  estimating; 

GD  =  a  -  b2PG  +  bgPE  +  b4Y  +  bsWH  +  e  (7) 

where, 

e  =  any  random  error  associated  with  the 
behavioral  relationship  between  GD 
and  the  other  variables. 

The  expected  sign  for  a  coefficient  is  indicated 
by  the  designated  sign  for  the  coefficient  in  (7). 

Additional  examination  of  descriptive  statistics 
of  the  Northeast  region  by  Herbert  ( 1988)  and 
other  regression  analysis  by  Herbert  (1986,  1987a), 
suggested  that  the  major  factors  affecting  GD 
during  the  time  period  had  been  considered. 

However,  it  was  thought  prudent  to  reconsider  the 
use  of  a  price  of  oil  variable  in  (7)  because 
fuel  oil  was  widely  used  by  households  in  the 
Northeast  during  the  time  period  and  other  studies 
had  recommended  the  use  of  this  variable. 

However,  estimated  results  from  other  studies  were 
mixed  with  statistically  insignificant  coefficients 
frequently  being  estimated  for  the  PO  variable. 

Additional  analyses  of  the  data  by  Doraan  et  al. 
(1986),  and  Herbert  (1987b)  indicated  that  all 
variables  included  in  the  regression  equation 
could  probably  be  represented  in  the  form  of  eq. 

1.  All  variables  are  either  proxy  variables  or 
they  are  known  to  be  measured  with  error.  Because 
of  this  measurement  error  problem,  it  was  decided 
to  evaluate  the  relationship  between  GD  and  PG, 

PE,  Y,  WH,  and  PO  using  the  Frisch  regression 
strategy . 
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5.  THE  EVALUATION  OF  NATURAL  GAS  DEMAND  RELATIONSHIPS 

The  first  series  of  bunch  maps  examined  in  Exhibit  1 
indicate  how  the  relationship  between  GD  (labelled 
with  a  1)  and  PG  (labelled  with  a  2)  is  affected 
by  the  addition  of  the  other  variables  in  the  analysis. 
Exhibit  2  lists  the  tilling  tables  required  to  construct 
the  bunch  maps  presented  in  Exhibit  1. 

Several  of  the  tilling  tables  in  Exhibit  2  will  be 
used  to  construct  the  bunch  maps  reported  in 
Exhibits  3  through  6.  Additional  tilling  tables  are 
listed  in  the  Data  Appendix.  The  order  of  a  bunch 
map  is  read  from  left  to  right  and  from  top  to 
bottom . 

The  first  bunch  map  in  Exhibit  1  indicate  that 
the  relationsip  between  GD  and  PG  is  negative,  as 
expected,  whether  we  minimize  in  the  direction  of 
GD  or  PG  and  include  only  these  two  variables  in 
the  analysis.  The  range  of  the  coefficients. 

Exhibit  1.  Bunch  Maps  for  the  Relationship  between 
Gas  Demand  (1)  and  Price  of  Gas  (2) 


however,  is  wide  as  Indicated  by  the  distance 
between  the  beams.  These  coefficients  are  readily 
calculable  from  the  entries  in  the  tilling  table  as 
being  equal  to  -0.5  -(.5421/1)  and  -1.5  -(1/.5421). 

The  next  two  bunch  maps  in  Exhibit  1  indicate 
that  when  we  include  PE  (variable  3)  and  then  Y 
(variable  4)  in  the  analysis,  the  relationship 
is  no  longer  identified  (i.e.  both  positive  and 
negative  coefficient  values  are  obtained).  This 
is  lilcely  to  occur  when  the  overall  fit  of  the 
linear  relationship  is  poor  and/or  when  we  have 
not  included  a  a  sufficient  number  of  variables 
to  identify  the  relationship. 

The  inclusion  of  WH  (variable  5) ,  however, 
identifies  the  relationship  between  GD  and  PG. 

The  range  of  li)<ely  values  is  obtained  from 
entries  in  the  first  two  columns  of  the  fourth 
tilling  table  of  Exhibit  1  with  a  lower  bound  of 

Exhibit  2.  Tilling  Tables  Required  to  Construct 
the  Bunch  Maps  in  Exhibit  1 
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libit  4.  Bunch  Map  for  the  Relationship  between 
Gas  Demand  (1)  and  Income  (4). 
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lap  for  the  Relationship  between  Gac  Demand  {  1 )  and 
fe  Space  Heating  Requirements  per  Gas  Customer  (5). 


-.76  and  an  upper  bound  of  -1.89.  As  a  check  on 
the  linear  connection  between  any  one  variable 
and  the  other  variables  for  this  bunch  map,  the 
multiple  correlation  coefficient  is  readily 
obtained  from  any  entry  along  the  diagonal  of  the 
associated  tilling  table  and  the  last  entry  in 
the  last  tilling  table  which  entry  represents  R" 
for  the  correlation  matrix  that  includes  variable 
1  through  5.  For  example, 

«C<^GD.PG  ...  WH  =  <1  -  -0579/. 3804) =  .93 

'learly,  the  fit  is  improved  by  including  variables 
3  through  5  in  the  analysis.  The  increase  in  fit 
is  also  indicated  by  the  length  of  the  beams. 

The  similarity  of  the  diagonal  elements  for  PE 
and  y  in  the  associated  til’.Lng  table  indicates 
the  similarity  of  the  linear  connection  between 
these  variables  and  the  other  variables. 

The  final  bunch  map  in  Exhibit  1  indicates 
that  the  relationship  is  no  longer  identified 
when  we  add  PO  (variable  6).  The  length  of  the 
axis  had  to  be  reduced  from  1  to  0.10  for  this 
bunch  map  to  be  properly  viewed.  The  greatly 
reduced  becun  lengths  when  we  add  PO  indicates  that 
the  variability  for  estimating  any  coefficient 
is  greatly  reduced  when  we  include  this  variable 
in  the  analysis.  Several  of  the  entries  in  the 
tilling  table  used  to  construct  these  beams  are 
so  small  that  the  coefficient  is  almost  of  the 
indeterminant  form  of  a  zero  divided  by  a  zero. 
According  to  Frisch  (1934,  pp  5-6),  when  the 
magnitudes  used  to  estimate  the  coefficients  are 
especially  small,  it  is  reasonable  to  consider 
such  small  magnitudes  to  be  a  consequence  of 
randomness  in  the  data  from  the  unobservable 
random  component.  The  coefficient 
is  a  good  example  of  such  a  case.  The  coefficient 
is  calculated  as  the  ratio  of  .0046/. 00003  or 
15.33,  which  is  an  unreasonably  large  value  for 
this  coefficient.  The  diagonal  elements  are  also 
similar.  This  suggests  that  the  linear  connection 
between  any  one  variable  and  the  other  variables 
is  very  similar  and  that  collinearity  is  a  problem 
because  observed  variables  are  measured  with 
error.  The  calculation  of  the  appropriate  multiple 
correlation  coefficients,  for  which  the  appropriate 
I R|  is  .005289,  also  reveals  this  similarity,  they 
are : 


MCCgd 

PG  .  . 

.  PO  - 

.  9596 

HCCpQ 

GD  .  . 

.  PO  = 

.9570 

MCC„„ 

GD  .  . 

11 

0 

.9603 

MCCpo 

GD  .  . 

.  WH  “ 

.9504 

The  bunch  maps  for  Exhibits  3  and  4  are  similar 
in  the  sense  that  the  relationship  between  GO  and 
either  Y  or  PE  is  identified  when  we  include  WH 
and  not  identified  when  we  include  the  PO  variable. 
Nonetheless,  the  range  of  likely  values  for  the 
relationship  for  PE  and  Y  is  quite  large  for 
the  bunch  map  that  includes  WH.  For  example,  b"*i3 
is  equal  to  .17,  is  equal  to  1.99. 

Exhibit  5  displays  a  different  picture  from  the 
preceding  bunch  map  exhibits.  The  relationship 
is  consistently  identified  and  the  range 


of  likely  values  is  relatively  tight,  until  we  add 
PO.  In  general,  this  relationship  is  well 
determined  as  long  as  PO  is  not  included. 

Exhibit  6  indicates  that  the  relationship  between 
gas  demand  and  the  price  of  oil  is  consistently 
positive,  as  expected  for  a  cross  price  elasticity, 
until  we  add  WH.  the  relationship  is  also  fairly 
well-determined  when  we  include  only  variables  1, 

2,  and  3.  The  relationship  is  less  well  determined 
when  we  add  variable  4. 

6.  THE  BAYESIAN  TURN 

In  the  preceding  analysis,  we  have  used  the 
Frisch  technique  to:  identify  the  range  of  likely 
values  for  a  coefficient:  to  determine  whether  the 
identification  of  a  relationship  was  affected  by 
the  addition  of  a  particular  variable  in  the 
analysis;  to  efficiently  discover  any  possible 
collinearity  problems  in  the  data  set;  but  we  have 
not  imposed  knowledge  of  data  in  the  estimation. 

Seme  information,  however,  is  available  on  the 
relative  magnitude  of  the  measurement  errors 
associated  with  each  variable.  For  example,  we 
expect  the  observed  PO  variable  to  be  the  least 
accurate  measure  of  it's  true  variable  value  and 
we  can  impose  such  restrictions  on  the  other 
variables.  Moreover,  information  about  measurement 
errors  can  be  used  to  bound  or  reduce  the  range  of 
likely  values  for  a  coefficient  by  the  method 
proposed  by  Klepper  and  Learner  (1984)  and  by 
Klepper  (1987).  One  can  think  of  the  procedure  as 
indicating  how  the  sample  information  would  map 
different  prior  restrictions  into  posteriors. 

Rather  than  update  a  prior  distribution  for  a 
regression  coefficient  based  on  sample  information, 
the  procedure  indicates  how  the  sample  information 
would  map  different  priors  into  posteriors.  In 
order  to  implement  the  procedure,  judgements  must 
be  formed  either  about  the  maximum  value  of  R^ 
associated  with  (7)  when  all  variables  are  correctly 
measured  and/or  aljout  measurement  error  variances 
of  the  explanatory  variables.  For  this  application 
the  proportion  of  the  observed  variance  in  a 
measured  variable  due  to  measurement  error  is 
specified.  The  necessary  assumptions  and  the 
nature  of  the  measurement  error  in  the  observed 
variable  values  considered  here  are  discussed  in 
further  detail  in  Herbert  (1987c).  Based  on  the 
previous  analyses  we  specify  the  error  variance 
(VAR)  as  a  proportion  of  each  observed  variable 
variance  to  be: 


VARqj3**/VARqq  —  0.02 

(8) 

VARpG*»AARpG  =0.02 

(9) 

VARpE*»/VARpE  =0.03 

(10) 

VARy»»/VARy  =0.03 

(11) 

VARwh**/VARwh  =  0.04 

(  12) 

The  initial  estimated  relationships  between 
GD  and  the  other  variables  are  obtained  by  first 
minimizing,  as  in  (2)  -  (4),  and  then  normalizing, 
as  in  (2a)  -  (4a).  These  estimates  are  reported 
in  Exhibit  7.  The  final  estimates,  after  we 
impose  the  measurement  error  variance  constraints, 
as  in  (8)  thru  (12),  are  also  reported.  These  final 
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EXHIBIT  7.  Initial 

and  Final 

Estimates 

Coefficients 

Minimization 

Direction 

Initial 

GD 

PG 

PE 

Y 

WH 

PG 

-.76 

-1.15  - 

1  .89 

-1.25 

-1.15 

PE 

.  14 

.35 

1.99 

.80 

.87 

Y 

.23 

.38 

.30 

1.25 

.35 

WH 

Final 

.83 

.97 

1.37 

1.27 

.91 

PG 

-.83 

-1.04  - 

■1.36 

-1  .08 

-.  90 

PE 

.  18 

.30 

1.03 

.21 

.24 

Y 

.28 

.36 

.33 

.79 

.35 

WH 

Elasticities 

.  91 

.99 

1.17 

1.14 

1.06 

PG 

-.56 

-.76 

-.92 

-.73 

-.61 

PE 

.21 

.34 

1 .  18 

.24 

.28 

Y 

.15 

.  19 

.17 

.41 

.  18 

WH 

.70 

.76 

.90 

.89 

.82 

coefficients  are  also  expressed  as  elasticities 
which  are  commonly  used  measures  in  economic 
analysis.  The  elasticities  are  obtained  by  multiplying 
the  coefficients  by  the  ratio  of  the  appropriate 
standard  errors,  that  is: 

SBl^ij  =  (Sj^/S.  )  =  -(r’'VSj)/(r''^/Sj^)  (13) 

This  type  of  transformation  is  used  to  obtain  the 
coefficient  that  would  have  been  obtained  if  the 
sample  variance/covariance  matrix  rather  than  the 
correlation  matrix  were  used  in  the  estimation  of 
coefficients  as  is  ordinarily  done  in  most  regression 
analyses. 

A  comparison  of  the  entries  in  Exhibit  7  for  the 
initial  estimation  and  the  final  estimation  indicates 
that  the  range  of  likely  values  in  the  final  estimation 
is  much  reduced  from  the  range  of  likely  values  in  the 
original  estimation.  In  particular  the  range  of 
values  for  the  PE  coefficient  is  much  narrower. 
Nonetheless,  the  upper  bounds  of  1.18  for  the  PE 
elasticity  and  of  .41  for  Y  appear  high  relative 
to  the  other  values  for  these  coefficients.  The 
elasticity  for  PE  also  seems  high  from  a  subject 
matter  point  of  view. 

The  key  coefficients  in  the  analyses  of  gas  demand 
are  the  PG  coefficents,  designated  an  own-price 
elasticity  when  it  is  expressed  as  an  elasticity, 
and  the  WH  coefficient  which  represents  space 
heating  capital  equipment  stock  effects.  Fortunately, 
the  WH  and  PG  coefficients  are  more  stable  across 
equations  than  the  other  coefficients. 

7.  SUMMARY  AND  CONCLUSIONS 

In  this  paper  we  have  demonstrated  the  instructive 
value  of  the  Frisch  regression  strategy  as  a  data 
analysis  tool  when  errors-in-variables  are  suspected 
in  a  regression  analysis.  The  Frisch  strategy  was 
used  to  discover  any  identification  problems  for 
the  relationship  between  GD  and  the  other  variables. 
The  WH  variable  was  found  to  be  important  and  the 
PO  variable  was  found  to  be  detrimental  in  the 
identification  process.  The  instability  of  the  Y 
and  PE  coefficients  was  clearly  observable  from  a 
casual  examination  of  the  bunch  maps.  The  analysis 
3I30  suggests  that  information  about  the  extent  of 


the  measurement  error  in  •'  and  PE  might  be  able  to 
be  used  to  reduce  the  range  of  likely  values,  that 
is  to  increase  the  precision,  of  the  reported 
coefficents.  For  example,  we  would  not  have  to 
minimize  in  the  direction  of  PE  if  it  were  found 
that  the  PE  variable  accurately  represented  the 
price  paid  by  natural  gas  customers  for  electricity. 

This  would  greatly  reduce  the  reported  bounds  on  the 
elasticity  as  indicated  by  reconstructing  Exhibit  7 
without  the  PE  column.  Frisch's  recommended 
procedures  were  also  used  to  identify  any  collinearity 
problems  within  the  data  set.  This  application 
does  not  circumscribe  the  usefulness  of  the  Frisch 
strategy  as  a  data  analysis  tool.  For  example, 

Malinvaud  ( 1966,  pp.  32-36)  used  the  method  as  a 
means  of  identifying  the  number  of  linear  relationship 
in  a  data  set  and  then  estimating  these  seperate 
relationships  rather  than  one  relationship. 

Frisch  (1934)  and  Stone  (1952),  throughout  their 

texts,  used  the  Frisch  strategy  as  a  visual  documentation 

of  a  regression  analysis. 

Finally,  we  have  shown  how  newer  techniques  can 
help  in  the  identification  process.  With  these 
techniques,  information  about  the  measurement  error 
in  observed  variables  was  used  to  reduce  the  range 
of  likely  values  for  estimated  coefficients.  This 
newer  approach  further  underlines  the  importance 
of  information  on  the  measurement  error  for  an 
estimation . 
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DATA  APPENDIX 


This  appendix  contains  some  additional  statistics 
which  are  useful  in  considering  the  results  presented 


Exliibit  9.  Tilling  Tables  for  Fxamining  the  Relationship 
Between  Price  of  nil  (6)  and  Other  Variables 
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in  the  main  body  of  the  text.  Listed  below  for 
the  convenience  of  some  readers  is  the  standard  estimated 
results  (standard  errors  are  presented  in  parenthesis) 
for  the  initial  equation  estimated  by  OLS. 
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The  tilling  tables  required  to  compute  the  coefficents 
in  Exhibits  3  through  6,  which  were  not  presented  in 
Table  2,  are  presented  in  Exhibit  8.  Additional  tilling 
tables  relating  to  the  controversial  PO  variable  are 
reported  in  Exhibit  9.  The  standard  deviations  of  the 
variables  used  in  this  analysis  are:  GD=. 16097,  PG=. 23739, 
PE=.  14041,  Y=. 08394,  WH=. 20928  which  are  required  for 
the  calculation  of  the  elasticities.  The  average 
values  of  the  variables  used  in  this  analysis  are: 

GD=4. 5  5  940,  PG  =  . 43313,  PE=  1.94  5  6  3,  Y=1. 27398,  WH  =  9. 07288. 
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Abstract 

Analysis  of  covariance  under  conditions  of 
small  covariate-criterion  correlations  is 
examined.  In  the  case  of  correlational  research 
the  increase  in  precision  of  an  F-test  assumed 
by  the  addition  of  covariates  is  questioned.  To 
test  whether  the  increase  in  precision  assumed 
was  offset  by  an  increase  in  bias,  a  set  of 
simulations  was  conducted.  The  first  simulation 
showed  the  degree  to  which  non-zero 
non-significant  correlations  between  covariate 
and  criterion  changed  the  tail  probability  of 
the  F-test.  The  second  simulation  Included  all 
"significant"  covariates  from  a  set  of  random 
normal  variates  showing  how  selecting  all 
significant  covariates  without  controlling  for 
the  number  of  covariates  considered  can  effect 
the  F-test. 

Introduction 

Researchers  faced  with  results  of  an  analysis 
of  variance  (ANOVA)  often  wonder  how  the 
analysis  would  have  changed  had  study 
participants  been  equivalent  on  background 
variables.  Attempts  to  control  for  effects  of 
these  background  or  nuisance  variables  in  such 
uncontrolled  studies  has  led  to  widespread  use 
of  analysis  of  covariance  (ANCOVA).  While  most 
scientists  agree  this  procedure  is  a  poor 
substitute  for  true  experimental  control,  many 
feel  that  certain  research  designs  (Cohen  & 

Cohen,  1975),  in  particular,  non-equivalent 
group  designs  often  used  in  the  situations 
described  above,  require  the  use  of  a  covariate 
or  a  set  of  covariates;  especially  in  situations 
in  which  linear  univariate  relationships  between 
background  variables  are  explanations  for  any 
obtained  group  differences  on  a  criterion 
variable  of  interest.  In  a  controlled  study  the 
possible  confound  represented  by  a  specific 
background  variable  is  often  randomized  out  of 
the  design;  however,  in  the  uncontrolled  study, 
some  other  procedure,  most  often  ANCOVA  is 
suggested  to  adjust  for  inequalities  on  that 
background  variable. 

While  some  conditions  under  which  ANCOVA  may 
be  troublesome  have  been  addressed  (Elashoff, 

1969;  Glass,  Peckham,  &  Sanders,  1972),  one 
problem  under-represented  1n  the  literature 
regards  the  use  of  a  variable  or  a  set  of 
variables  as  covariates  when  those  variables 
have  relatively  low  correlations  with  the 
criterion  variable  under  study.  This  paper 
shows  the  degree  that  such  low  correlations  can 
bias  the  conclusions  based  on  tests  of  effects 
of  ANCOVA.  This  bias  is  due  to  assumptions 
underlying  ANCOVA.  Simulated  and  real  data 
examples  are  presented.  Finally,  a  Monte  Carlo 
study  showing  the  effects  of  selecting  and 
controlling  for  significant  covariates  generated 
from  a  set  of  random  variates  will  be  presented. 

As  In  most  real  data  analyses,  the  correlation 
between  covariates  and  independent  variables  is 
not  control  led. 

On  the  Use  of  ANCOVA 

In  the  analysis  of  variance,  any  differences 
on  the  criterion  variable  are  assumed  to  be  due 
to  membership  In  the  groups  defined  by  the 


State  University 

independent  grouping  or  treatment  variable.  Let 
E(Y,)  and  E(Y,)  be  the  expectations  or  means  for 
Groups  1  and  2,  respectively,  then 
E(Y  )  -  E  (Y  )  =  a.  (1) 

If  the  difference  is  a  function  of  only  group  or 
treatment  effects,  the  estimate  of  a  is 
unbiased.  If,  on  the  other  hand,  the  difference 
In  the  observed  values  on  the  criterion  may 
result  from  contributions  of  other  sources  in 
addition  to  the  independent  variable.  Equation  1 
becomes  (2) 

E(Y  )  -  ECY,)  =  a  +  f(X.,  Ky,  X3 . X^)  or 

expressed  as  a  conditional  probability 

E(Y,|X, . X,  )  -  E(Y2|Xj,  ...  X.  )  =  a  (3) 

where  f  is  a  function  oT  some  set  of  variables 
X.  that  contribute  to  the  observed  group 
differences.  This  function,  then,  represents 
the  degree  of  bias  in  the  analysis  of  variances 
when  sources  of  systematic  criterion  variable 
differences  other  than  the  planned  independent 
variable  exist. 

When  a  single  variable  X  can  be  located  that 
provides  an  unbiased  estimator  of  a,  Equation  2 
can  be  written  in  terms  of  conditional 
probabilities  as 

E(Y  |X)  -  ECY-IX)  =  a  (4) 

This  covariate,  X,  then,  allows  one  to  adjust 
the  analysis  for  this  additional  source  of 
variation.  Analysl s-of-covariance  was 
originally  developed  to  increase  precision  in 
randomized  experiments  by  adjusting  for  effects 
of  additional  variables  not  Involved  In  the 
assignment  of  individuals  to  treatment  or 
control  groups  (Fisher,  1932).  Adjustments  made 
using  the  covariate  are  only  expected  to  yield 
unbiased  estimates  of  effects  (group 
differences)  in  the  case  of  random  assignment. 

Compared  with  ANOVA,  ANCOVA  Is  assumed  to 
provide  a  better,  though  possibly  biased, 
estimate  of  a  when  a  covariate  can  be  Isolated 
that  is  confounded  with  the  treatment  i  th 
such  a  confound  ANCOVA  provides  a  more  precise 
error  term  (Cochran,  1957).  The  benefit  due  to 
increased  precision  may  be  offset  by  bias 
introduced  into  the  analyses  by  addition  of  a 
covariate.  Although  the  degree  of  bias  in 
uncontrolled  studies  is  often  unknown  (Weisberg, 
1979),  analysis  of  covariance  is  suggested  as  an 
appropriate  data  analytic  strategy  when  sources 
of  variation  are  located  that  are  related  to  the 
dependent  or  criterion  variable  of  Interest  but 
are  unrelated  to  the  independent  grouping 
variables  representing  treatment  or  individual 
difference  effects. 

The  size  of  relationship  between  any 
covariate  and  the  criterion  necessary  to 
minimize  bias  and  maximize  precision  has 
normally  not  been  specified,  although  some 
suggestions  have  appeared.  Cox  (1957)  suggested 
that  when  both  covariate  and  criterion  are 
assumed  to  be  drawn  from  a  bivariate  normal 
distribution,  p  >  .60  could  be  used  as  a  cutoff 
when  ANCOVA  Is  preferable  to  blocking.  Maxwell, 
Delaney,  and  Dill  (1984)  argued  that  the  size  of 
the  correlation  Is  generally  not  Important  In 
deciding  between  blocking  and  use  of  a 
continuous  covariate.  In  designs  such  as  the 
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non-equivalent  post-test  design  in  which  the 
covariate  is  used  as  a  proxy  pretest,  only  a 
perfect  correlation  between  the  covariate  and 
criterion  can  assure  that  the  use  of  a  covariate 
as  a  control  would  not  introduce  bias  (Cook  & 
Campbell,  1979).  These  and  similar  suggestions 
(Myers,  1979)  make  one  wonder  what  happens  in 
the  case  when  p  tends  toward  zero.  In  this 
vein,  the  low  correlation  will  be  discussed  in 
the  context  of  assumptions  of  ANCOVA. 

As  often  discussed  (Cochran,  1957;  Myers, 
1979),  ANCOVA  is  designed  for  the  analysis  of  an 
experiment  in  which  a  set  of  nT  individuals  have 
been  selected  at  random  and  assigned  at  random 
to  T  treatment  conditions.  The  complication 
arises  when  another  variable,  X,  is  shown  to  be 
correlated  with  the  criterion  variable  measuring 
the  way  in  which  manipulated  groups  are  expected 
to  differ.  The  question  then  arises:  Did  the 
groups  really  differ  on  the  criterion  because  of 
the  manipulation,  or  can  that  difference  be 
explained  by  the  covariate?  The  way  usually 
recommended  to  answer  the  questions  involves 
adjustment  of  scores  of  the  dependent  variable 
by  extracting  the  effect  of  the  covariate  from 
the  criterion  and  essentially  analyzing  residual 
scores.  If  a  difference  is  still  obtained,  the 
treatment  groups  are  more  likely  to  actually 
differ  on  the  criterion  variable. 

In  order  to  properly  implement  the  ANCOVA  a 
set  of  assumptions  is  required  to  assure  that 
tests  of  differences  in  mean  criterion  scores 
are  unbiased  (Elashoff,  1969). 

The  model  for  the  simple  one-way  fixed 
effects  analysis  of  covariance  is 
y,  =  u  +  a.  +  3  (X^  -  X)  +  e.  (5) 

iA  which  p'^is  the  mean  of  thA  criterion  variable 
across  individuals  and  treatments,  a.  is  the 
deviation  from  the  mean  due  to  the  effect  of  the 
treatment,  3(X,  -  X)  is  the  variability 
accounted  for  by  the  covariate  expressed  in 
terms  of  the  regression  slope,  3,  of  Y.  onto 
(X,-X),  and  e.  is  an  unexplained  error  in  the 
individual's  residual  score.  The  ANCOVA  model 
is  an  extension  of  the  ANOVA  model  and  is 
subject  to  the  same  assumptions  with 


2 

e  ~  NID  (0,  0  )  for  i  =  1,  ...,  nT  (6) 

i  e 


I  Oj  -  0 


for  j  =  1,  . . . ,  k  (7) 


when  individuals  are  randomly  assigned  to  the 
treatment  conditions  as  required  by  the  ANOVA. 

Additional  assumptions  that  are  needed  when  a 
covariate  is  included  in  the  analysis  appear  in 
Elashoff  (1969).  Violation  of  any  ANCOVA 
assumptions  can  introduce  bias  into  tests  of 
effects.  In  particular,  use  of  covariates  that 
correlates  poorly  with  the  criterion  exacerbates 
the  degree  of  bias  as  follows.  As  Cochran 
(1957)  and  Elashoff  (1969)  have  shown,  the 
increase  in  precision  of  the  F-test  for  main 
effects  can  be  expressed  in  terms  of  the 
decrease  in  error  variability  due  to  the 
addition  of  a  covariate.  This  decrease  can  be 
expressed  as  a  function  of  the  correlation 


between  the  criterion  and  the  covariate.  The 
decreased  error  variability  is 

2  2  2 

o  =  o  (1  -  p  )  (1  +  l/(f  -  2))  (8) 

e.  X  e  e 


where  Og  )(  i  s  the  error  variability  with  the 
covariat^  added  to  the  model,  a  is  the  error 
variability  as  obtained  in  the  XnOVA  model,  p  is 
the  correlation  between  the  criterion  and  the 
covariate,  and  f  are  the  degrees  of  freedom 
associated  with  Irror  term.  This  equation  can 
be  rearranged  to  yield  a  formula  for  the 
proportionate  reduction  in  error  variability  due 
to  addition  of  a  covariate.  The  equation 
becomes 
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A  /o  = 
e  e 


(9) 

2  2  2 

o  )/o  =  [p  (f  -  1)  -  l]/(f  -  2) 

e. X  e  e  e 


The  reduction  results  in  an  unbiased  parameter 
estimate  only  when  the  correlation  between 
covariate  and  grouping  variables  is  zero.  It  is 
the  case  that  even  though  the  population 
coefficient  may  be  zero,  the  sample  coefficient 
deviates  slightly  from  zero.  In  that  case,  the 
independence  of  the  covariate  as  a  predictor  of 
the  criterion  is  represented  by  p  „  „,  the  part 
correlation  reflected  in  the  regrlsiion  weight 
of  the  covariate.  In  this  the  independent 
variable  and  criterion  compete  for  the  covariate 
variance.  Since  the  redundant  variance  has  been 
removed  from  the  error  term  by  inclusion  of  the 
independent  variable,  the  proportionate 
reduction  in  error  variability  decreases  as  the 
assumption  of  independence  is  violated. 

The  change  in  error  variability  can  be 
estimated.  For  n=100  and  a  correlation  of  .20 
between  the  criterion  and  covariate  (a 
correlation  just  above  the  .05  level  of 
significance),  one  reduces  the  error  variance  by 
about  3%.  With  a  sample  of  100  observations,  it 
is  conceivable  that  the  correlation  of  .20 
reflects  a  chance  relationship.  If  this  is  the 
case,  adjustment  to  the  analysis  of  variance  is 
performed  for  a  variable  that  is  not  a  member  of 

the  set  {X. . X.j  for  the  function  in 

Equation  2.  This  adjustment  thus  is  tantamount 
to  arbitrarily  pulling  error  variation  out  the 
denominator  of  the  F-test. 

Weisberg  (1979)  has  demonstrated  for  the 
uncontrolled  study  in  the  case  of  the  linear 
model  that  the  proportion  of  bias  remaining 
after  the  ANCOVA  adjustment  is  a  function  of 
three  correlations:  (1)  p,„,  the  correlation 
between  group  membership  and  a  criterion 
variable  calculated  under  the  imaginary 
condition  that  all  individuals  were  assigned  to 
the  control  condition  (or  in  the  case  of  an 
individual  difference  variable,  to  the  same 
value  of  the  variable);  (2)  the  correlation 

between  that  same  criterion  measure  and  the 
covariate;  and  (3)  py„,  the  correlation  between 
the  covariate  and  group  membership.  The 
proportion  of  bias  remaining  after  adjustment 
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(Weisberg,  1979)  can  be  expressed  as 


2 


P  (1  -  P  ) 

ZQ  XQ 

In  practice,  only  can  be  estimated  from  the 
data.  However,  it  can  be  shown  that  the 
covariate  adjustment  is  unbiased  only  for  the 
cases  in  which  p,),  =  1.  For  our  purposes,  this 
equation  serves  to  demonstrate  that  even  in 
situations  in  which  one  is  able  to  determine 
precision  in  the  analysis,  one  can  at  best  only 
estimate  bias  for  the  uncontrolled  study.  The 
next  section  presents  data  which  indicate  degree 
of  that  bias. 

Ria^  niiP  to  Low  Covariate-Criterion  CorrH^ions 
The  following  discussion  concerns  the 
situation  in  which  a  non-zero  but 
non-significant  correlation  between  the 
covariate  and  the  criterion  exists.  The  cause 
of  such  a  correlation  could  be,  in  part,  due  to 
the  lack  of  reliability  of  the  covariate. 

However,  the  situation  with  which  this  paper  is 
concerned  is  that  in  which  a  variable  assumed  to 
be  covariate  has  a  true  correlation  of  zero  with 
the  criterion  variable. 

Two  reports  of  simulation  studies  will  be 
presented.  The  first  will  concentrate  on  one 
covariate  in  the  context  of  a  one-factorial 
design.  The  second  will  present  two  or  more 
covariates  in  a  two-factorial  design. 

Study  I:  Method  and  Results 
To  show  the  degree  of  bias  introduced  when 
controlling  for  a  statistically  non-significant 
relationship,  a  simulation  study  and  analysis  of 
empirical  data  were  conducted  in  which  a  set  of 
one-way  ANOVAs  on  a  criterion  variable  were 
performed.  The  grouping  variable  used  had  only 
two  levels.  The  criterion  variable  was  created 
by  generating  a  random  normal  variate  and 
assigning  a  grouping  number  (either  1  or  2)  to 
each' value  of  the  variate.  A  constant  was  then 
added  to  the  second  group  creating  the  group 
difference.  The  size  of  the  constant  was 
initialized  at  d  =  .02  and  incrementing  by  .05 
until  the  difference  became  statistically 
significant  at  the  g  <  .001  level.  Covariates 
were  selected  by  generating  a  set  of  random 
variates  (also  selected  from  a  normal 
distribution)  using  the  RANNOR  function  in  SAS 
and  correlating  each  variate  with  the  criterion 
measure.  The  selected  covariates  had 
correlations  with  the  criterion  ranging  from 
r  =  .01  up  to  a  level  of  correlation 
representing  a  relationship  just  under  the 
alpha  =  .05  level  of  significance.  Covariates 
were  assumed  to  be  homogeneous  across  and 
independent  of  groups.  Models  representing  each 
level  of  difference  were  then  re-estimated 
including  the  covariate.  The  simulation  was 
performed  for  n  =  100.  The  results  of  the 
analysis  appear  in  Table  1.  ^  , 

The  table  shows  the  patterns  of  the  tail 
probabilities  of  the  significance  tests  of  the 
estimated  model.  The  change  in  tail 


probabilities  from  the  ANOVA  F-test  to  the 
ANCOVA  F-test  is  a  function  of  the  correlation 
between  criterion  and  covariate  as  one  would 
expect.  For  the  case  in  whic.,  the  ,  cl ati o" 
is  near  zero,  and  the  adjustment  is  small 
compared  to  the  sampling  fluctuation,  the  change 
in  tail  probabilities  tend  to  oscillate  about 
the  nominal  alpha  level.  However,  within  each  a 
sufficiently  high  correlation  causes  adjustments 
in  the  direction  of  a  more  liberal  test.  The 
size  of  the  correlation  sufficient  to  effect  a 
change  is  surprisingly  small.  For  n  =  100  and  a 
mean  difference  of  .050  between  the  two 
simulation  groups,  the  tail  probability  of  the 
F-test  is  .0721.  A  correlation  between  the 
covariate  and  criterion  of  r  =  . 17  (g  =  .0967) 
changes  the  tail  probability  to  .0463  which 
becomes  statistically  significant  if  one  uses 
the  g  <  .05  cutoff. 

Pi scussi on 

The  degree  of  bias  introduced  into  an 
analysis-of-variance  F-test  by  the  inclusion  of 
a  covariate  that  is  weakly  correlated  with  the 
criterion  variable  depends  on  both  the  size  of 
the  correlation  and  the  size  of  the  sample.  The 
degree  of  the  bias  can  roughly  be  estimated  by 
calculating  the  reduction  in  error  variance  in 
Equation  5  using  the  largest  possible 
nonsignificant  correlation  between  the  covariate 
and  the  criterion.  Using  this  equation  as  an 
estimate  assumes  that  the  reduction  in  variance 
identified  as  the  increase  in  precision 
functions  as  a  measure  of  bias.  This  is  the 
case  only  when  the  covariate-criterion 
relationship  can  be  attributed  only  to  sampling 
fluctuation.  The  results  of  this  study  suggest 
that  when  the  correlation  is  non-zero  and 
nonsignificant  the  tail  probability  of  the  test 
wi  1  most  often  be  underestimated.  In  the  case 
of  a  marginally  non-significant  test,  bias  (even 
with  an  n  as  large  as  100)  can  be  enough  to  push 
the  probability  value  into  the  significant  area 
of  the  sampling  distribution. 

The  decision  whether  to  include  a  covariate 
as  a  statistical  control  in  an  analysis  should 
be  based  on  a  number  of  criteria;  the  most 
Important  of  which  is  the  degree  to  which  the 
adjustment  represented  by  the  covariate  is 
theoretically  meaningful.  Once  that  criterion 
is  met,  one  should  determine  an  absolute  minimum 
value  of  the  criterion-covariate  relationship. 
The  minimum  value  will  be  a  function  of  the  size 
of  the  correlation,  the  sample  size,  and  the 
size  of  the  effect.  It  is  suggested  here  that 
an  absolute  minimum  be  determined  by  requiring 
the  correlation  to  be  above  a  level  of 
significance  determined  by  a  priori  for  the 
study  and  adjusted  for  the  number  of  covariates 
considered. 

While  the  models  discussed  above  included 
only  one  covariate,  one  can  imagine  the 
situation  with  more  than  one  covariate. 

Assuming  two  uncorrelated  covariates,  each  would 
tend  to  decrease  the  error  term  by  some  amount 
that  would  tend  to  decrease  the  size  of  the 
tail-probability  even  more.  One  could 
essentially  control  the  level  of  significance  by 
introducing  enough  weak  covariates  into  any 
analysi s. 

To  demonstrate  this,  we  ran  the  following 
simulation  study  to  show  the  effect  of  selecting 
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Table  1 

The  Effect 

of  a 

Random 

lovariate  on  a  One-Way  ANOVA  (n= 

100) 

( iai 

1  RrobaOi . Ities  of 

4.tic  r“ 

Mean 

Differences  Between  Groups 

Correlation 

.020 

.025 

.030 

035 

.040 

.045 

050 

.055 

060 

.065 

.070 

.075 

.080 

No  Covariate 

.8281 

.6291 

.4543 

3112 

.2019 

.1240 

0721 

.0397 

0207 

.0102 

.0048 

.0022 

.0009 

.01 
£  = 

.9358 

.8292 

.6311 

.4566 

3138 

.2043 

.1260 

0736 

.0407 

0214 

.0106 

.0050 

.0038 

.0010 

.03 

£  = 

.7538 

.8137 

.6174 

.4455 

3051 

.1981 

.1218 

0710 

.0392 

0205 

.0102 

.0048 

.0022 

.0009 

.05 
£  = 

.6214 

.7716 

.5817 

.4175 

2846 

.1842 

.1130 

0658 

.0364 

0191 

.0095 

.0045 

.0020 

.0009 

.07 

£  = 

.4730 

.7644 

.5737 

.4094 

2773 

.1781 

.1084 

0625 

.0342 

0177 

.0087 

.0041 

.0018 

.0008 

.09 
£  = 

.3654 

.8333 

.6341 

.4586 

3147 

.2046 

.1259 

0734 

.0405 

0212 

.0105 

.0049 

.0022 

.0010 

.11 
£  = 

.2612 

.7048 

.5212 

.3662 

2442 

.1543 

.0924 

0524 

.0282 

0144 

.0070 

.0032 

.0014 

.0006 

.13 
£  = 

.1620 

.7443 

.5537 

.3911 

2619 

.1660 

.0996 

0565 

.0304 

0155 

.0075 

.0034 

.0015 

.0006 

.15 
£  = 

.1233 

.7575 

.5648 

.3997 

2681 

.1701 

.1021 

0580 

.0312 

0159 

.0077 

.0035 

.0015 

.0006 

.17 
fi  = 

.0967 

.6781 

.4960 

.3442 

.2263 

.1409 

.0830 

.0463 

.0244 

0122 

.0058 

.0026 

.0011 

.0005 

.19 

E  = 

.0572 

.6882 

.5035 

.3493 

.2294 

.1425 

.0837 

.0465 

.0245 

0122 

.0058 

.0026 

.0011 

.0005 

Tab’'  2 

Analyses  in  Which  the  "Nonsignif leant  Covariates"  are  Retained 

Null  Hypothesis  Alternate  Hypothesis 
.05  (.01)  .05  (.01) 

Overall  Model  14%  (4%)  100%  (96%) 

Independent  Variable  A  25%  (8%)  99%  (99%) 

Independent  Variable  8  17%  (9%)  98%  (96%) 
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more 


a  set  of  "significant"  covariates  generated  from 
a  set  of  random  variates. 

Study  II:  Method  and  Results 

Monte  Carlo  data  for  a  serioc  of  100  double 
classification  analyses  of  covariance  v/ere 
generated  using  random  data.  For  these 
simulations,  care  was  taken  to  approximate  the 
sample  sizes  and  effect  sizes  found  in 
developmental  psychology  research.  In  the 
analysis  of  covariance,  two  main  effects  were 
defined  with  three  levels  each,  and  16  normal 
covariates.  Each  cell  of  the  analysis  contained 
12  observations,  making  a  total  sample  size  of 
108  for  each  analysis.  All  data  were  generated 
with  SAS  random  number  functions  for  generating 
normal  variates  (function  RANNOR).  For  the 
first  series  of  100  "experiments"  no  main  effect 
or  interaction  was  defined  in  the  dependent 
variable. 

Analyses  Based  on  Data  Where  the  Null  Hypothesis 
is  True 

For  the  first  set  of  Monte  Carlo  data  where 
the  null  hypothesis  was  true,  an  analysis  of 
covariance  was  conducted  to  estimate  the  effect 
of  false  inclusion  of  covariates  on  the  nominal 
alpha  level  of  the  experiment.  Any  covariate 
which  achieved  statistical  significance  at  the 
.05  level  was  left  in  the  model  for  the  data. 
This  situation  would  correspond  to  that  arising 
when  the  researcher  falsely  concludes  that  a 
covariate  is  statistically  significant  when  it 
is,  in  fact,  not.  Comparison  of  the  probability 
levels  associated  with  the  overall  model,  and 
with  the  effects  of  the  two  independent 
variables  form  the  basis  for  comparing  the 
nominal  error  rates  of  the  model  with  the  true 
error  rates.  For  probability  levels  associated 
with  the  overall  model,  14%  of  the  demonstrated 
significance  at  a  nominal  .05  alpha  level.  Four 
percent  of  the  experiments  demonstrated  a 
nominal  alpha  level  of  .01.  For  the  independent 
variables,  25%  and  17%  of  the  experiments  showed 
a  nominal  alpha  level  of  .05  for  the  first 
variable  and  second  independent,  respectively. 
The  proportions  of  experiments  showing  a  nominal 
alpha  of  .01  was  8%  and  9%,  respectively. 

Pi scussion 

As  can  be  seen,  selection  of  those 
"covariates"  that  appear  to  be  significant  and 
adjustment  for  those  variables  can  dramatically 
change  the  tail  probabilities  of  the  F-test.  In 
practice  such  an  error  of  inclusion  can  be 
avoided  by  carefully  controlling  the 
experiment-wise  alpha  level  used  to  define  a 
significant  correlation.  However,  researchers 
looking  at  single  covariates  that  account  for 
what  appears  to  be  a  small  part  of  the  variation 
in  a  criterion  variable  are  often  reluctant  to 
give  up  that  adjustment.  By  keeping  such 
variables  as  covariates,  they  must  realize  that 
ANCOVA  adjusts  for  anything  that  is  supplied  as 
a  covariate  in  the  analysis.  ANCOVA  will  Indeed 


attempt  to  adjust  even  when  the 
covariate-criterion  relationship  is  not' 
than  chance.  Since  the  proportion  of  bias 
rema ’ r i "c  (Equctio"  II)  tfte’"  the  adj'i  tment  is 
inversely  proportional  to  the  size  of  the 
covariate-criterion  correlation,  care  must  be 
taken  in  the  selection  of  variables  that  act  as 
covariates. 

It  is  ultimately  the  choice  of  the 
investigator  to  determine  which  variable  might 
theoretically  serve  as  covariates.  However,  as 
Weisberg  (1979)  pointed  out,  inclusion  is  only 
appropriate  if  one  can  assume  that  individuals 
who  have  the  same  value  on  a  covariate,  and  are 
members  of  different  groups,  would  have  the  same 
value  on  the  criterion  in  the  absence  of  a  group 
effect.  We  add  that  even  when  this  assumption 
is  plausible,  one  must  be  assured  that  the  size 
of  the  covariate-criterion  relationship 
signifies  a  truly  non-zero  and  theoretically 
meaningful  relationship. 
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Abstract 

With  recent  advances  in  computer  processing 
speed,  statistical  packages  with  generalized  max¬ 
imum  likelihood  estimation  subroutines  are  pro¬ 
liferating.  Unfortunately,  convergence  criteria  in 
these  packages  are  based  on  the  step-wise  change 
of  the  parameter  estimates  or  on  the  closeness  of 
the  first  derivative  vector  to  0.  No  measure  of 
the  adequacy  of  the  asymptotic  parameter  vari¬ 
ance  matrix  exists  and  most  statisticians  are  un¬ 
aware  that  the  variance  matrix  approximation 
based  on  the  commonly-used  quasi-Newton  it¬ 
erative  methods  can  be  poor.  We  examine  the 
behavior  of  this  approximation  for  two  represen¬ 
tative  likelihoods  and  suggest  an  additional  con¬ 
vergence  criterion  that  may  help  the  user  to  de¬ 
termine  when  the  variance  matrix  as  well  as  the 
parameter  vector  are  sufficiently  close  to  their 
true  values. 

1.  Introduction 

Maximum  likelihood  parameter  estimation  requires  an 
iterative  solution  for  all  but  the  simplest  statistical  dis¬ 
tributions  for  which  analytic  solutions  are  available.  The 
most  popular  method  of  solution  has  been  Newton’s 
method,  which  converges  quickly  and  automatically  pro¬ 
vides  the  most  accurate  estimate  available  of  the  asymp¬ 
totic  variance  matrix  for  the  parameter  estimates.  Un¬ 
fortunately,  this  method  requires  the  user  to  provide  the 
second  derivatives  of  the  likelihood  function  (the  Hes¬ 
sian  matrix),  a  task  which  is  often  very  difficult.  The 
quasi-Newton  class  of  unconstrained  optimization  algo¬ 
rithms  avoids  this  problem  by  approximating  the  in¬ 
verse  Hessian  matrix,  and  has  been  shown  to  converge 
superlinearly  to  the  correct  parameter  vector.  Use  of  a 
quasi-Newton  algorithm  and  an  accurate  first  derivative 

approximation  algorithm  coupled  with  the  recent  dra¬ 
matic  improvements  in  microcomputer  processing  speed 
has  made  possible  generalized  computer  programs  for 
maximum  likelihood  estimation.  It  is  no  longer  nec¬ 
essary  for  the  statistician  to  write  a  new  FORTRAN 
program  for  each  different  form  of  the  likelihood  to  be 
maximized,  thus  simplifying  examination  of  competing 
models. 

Because  the  quasi-Newton  methods  were  developed 
for  the  solution  of  deterministic  models,  primarily  in 
operations  research,  little  work  has  been  done  to  exam¬ 
ine  the  accurawry  of  the  approximation  for  the  variance 

‘current  addreaa:  Vincent  T.  Lombardi  Cancer  Reaearch  Cen¬ 
ter,  Georgetown  Univeraity,  Washington,  DC  20007 


matrix  or  its  rate  of  convergence.  We  have  examined 
the  behavior  of  this  matrix  approximation  for  several 
representative  likelihoods.  Comparison  of  known  ana¬ 
lytic  results  to  results  from  a  quasi-Newton  procedure 
using  an  optimal  step  size  suggests  that  after  the  first 
few  iterations  the  .’ax lance  matrix  approximation  con¬ 
verges  to  its  correct  values  at  nearly  the  same  rate  as 
the  parameter  vector  itself.  We  propose  a  method  of 
determining  when  the  matrix  has  converged  sufficiently 
to  a  solution. 

2.  Maximum  Likelihood  Estima¬ 
tion 

Let  Z,  denote  the  data  vector  for  observation  r  and 
X  =  (i),  Xj,  •  •  •  ,ij)'  denote  the  parameter  vector  to 
be  estimated.  Then 

L(Z-,X)  =  llf(Zr-,X),  (1) 

r=l 

is  the  likelihood  function  to  be  maximized  over  possible 
values  of  X,  where  /(Z,;  X)  is  the  probability  density 
function  for  X.  Maximum  likelihood  estimation  may 
be  viewed  as  a  general  unconstrained  optimization  prob¬ 
lem,  requiring  solution  of  any  of  the  following  equivalent 
problems: 

m^L{Z;X)  or  nun[- L(Z]X)]  or  mln[- In  L(Z;  X)) 

For  convenience,  we  will  solve  the  last  problem.  Let 
G(Z;X)  =  —\nL(Z\X),  the  objective  function  to  be 
minimized.  If  we  may  assume  that: 

1.  G[Z\X)  is  twice  continuously  differentiable, 

2.  X’  is  the  unique  solution  to  the  problem;  X'  is 
called  the  maocimum  likelihood  estimate  of  X, 

3.  G(Z-,X)  is  strictly  convex  in  a  neighborhood  about 
X'  (i.e.,  L(Z;X)  is  strictly  concave), 

then 


the  asymptotic  variance  matrix  of  the  parameter  vector 


SOS 


X.  In  practice,  we  estimate  this  variance  matrix  by  H", 
the  inverse  Hessian  matrix  evaluated  at  the  maximum 
likelihood  cstim?*es  the  parameters  (X’). 

3.  Iterative  Methods 

For  all  but  the  simplest  models,  emalytic  solutions  to 
the  maximum  likelihood  optimization  problem  do  not 
exist.  Iterative  methods  of  solution  generally  include 
the  following  steps: 

1.  determine  initial  estimates  for  X,  denoted  Xq 

2.  compute  =  Xi  +  S.t, 

where  X^  =  (i,, ,  Xj2,  •  •  ■ ,  x.j)',  the  parameter  vec¬ 
tor  estimate  at  iteration  i;  S,  is  the  direction  vec¬ 
tor  for  this  step  and  t,  is  the  step  size  which  min¬ 
imizes  G  along  the  ray  S. 

3.  repeat  step  2  until  convergence  or  the  maiximum 
number  of  iterations  is  reaw:hed. 

For  the  Newton  li^e^hod  of  solution, 


Calculation  of  the  optimal  step  size  t,  is  required  to  en¬ 
sure  a  significant  improvement  at  each  step  (Chambers 
1977).  However,  most  implementations  of  Newton’s  al¬ 
gorithm  use  a  constant  step  size  of  1,  which  is  optimal 
for  a  quadratic  objective  function,  and  only  adjust  t, 
(e.g.,  by  step  halving)  when  no  improvement  is  seen  in 
the  solution  vector. 

The  Newton  method  converges  quadraticaliy  to  the 
solution  vector  but  requires  the  first  and  second  partial 
derivatives  of  the  objective  function  and  inversion  of  the 
Hessian  matrix  at  each  step.  Because  the  second  partial 
derivatives  are  often  tedious  to  calculate,  quasi-Newton 
methods  approximate  the  inverse  Hessian  by  //,  ,  an  up¬ 
date  of  the  last  iterate’s  matrix;  that  is,  Hj+i  =  Hi 
Then 


Several  commonly  used  updating  methods  are  the 
Davidon-Fletcher-Powell  (DFP)  and  Broyden-Fletcher- 
Goldfarb-Shanno  (BFGS)  methods  (McCormick  1983, 
chap.  9). 

Obvious  advantages  of  the  quasi-Newton  methods 
are  that  no  second  derivatives  or  matrix  inversion  are 
required.  By  combining  a  quasi-Newton  inverse  Hes¬ 
sian  update  algorithm  with  algorithms  to  approximate 
the  first  derivative  vector  (i.e.,  the  score  vector)  and  to 
calculate  the  optimal  step  size  at  each  iteration,  a  gen¬ 
eralized  optimization  program  may  be  developed  that 
requires  the  user  to  provide  only  the  objective  function. 

Unfortunately,  these  inverse  Hessian  approximations 
can  be  poor  estimates  of  the  asymptotic  variance  ma¬ 
trix.  This  is  a  critical  defect  for  optimization  of  sta¬ 


tistical  problems,  since  the  variance  matrix  is  used  to 
test  the  significance  of  the  parameter  estimates  and  to 
calculate  confidence  limits  for  the  parameters. 

The  order  of  the  rate  of  convergence  of  an  iterative 
procedure  is  defined  to  be  the  power  p  such  that 


lim 

l-*0O 


ll^< 


+  1 


\\Xi  -  X’W’  ^ 
where  M  is  a  constant.  ||C||  denotes  a  norm  of  any  vec¬ 
tor  C\  we  will  use  the  Lj  norm;  i.e.,  |1C||  =  (SyCy’)*^*- 
An  algorithm  is  said  to  converge  superlinearly  if  p  >  1. 
It  may  be  shown  that  Newton’s  method  converges  quad- 
ratically  (i.e.,  p=2).  For  secant  methods  in  general, 
a  class  of  algorithms  including  qusisi-Newton  methods, 
Tornheim  (1964)  showed  that  the  asymptotic  order  of 
the  rate  of  convergence  is  the  solution  to  the  equation 
—  p'^  —  1  =0,  where  J  is  the  number  of  parameters 
to  be  estimated.  For  example,  this  result  shows  that  for 
a  single  pararricter  p  =  (1  I  \/5)f2  -  1.618.  The  rate 
of  convergence  is  slower  with  an  increasing  number  of 
parameters,  but  in  all  cases  is  superlinear. 

In  practice,  we  may  estimate  the  order  of  the  rate  of 
convergence  of  the  parameter  estimates  by  taking  log¬ 
arithms  of  both  sides  of  (6),  and  solving  the  following 
simple  linear  regression  problem  for  p  over  successive 
iterations  (»  =  1, 2,  ■  •  • ,  7); 


ln||X,-+, -XT  a  +  pln||X,-X'l|.  (7) 


Obviously,  if  we  are  considering  using  the  quasi- 
Newton  method  at  ail,  an  analytic  solution  is  unavail¬ 
able.  Thus,  the  asymptotic  order  of  the  rate  of  conver¬ 
gence  only  provides  a  guide  to  the  rate  of  convergence 
expected  in  practice,  amd  is  not  useful  in  determining 
whether  convergence  has  been  reached  in  any  particu¬ 
lar  situation.  How  can  we  judge  whether  the  H  matrix 
approximation  adequately  represents  the  true  variance 
matrix? 

Let  Oi  =  X^+i  -  Xi,  the  d'  ference  in  the  parame¬ 
ter  estimates  from  one  iteration  to  the  next,  and  = 
VG(Z;X,+i)  -  VG(Z;X,),  the  difference  in  the  first 
derivatives  of  the  objective  function  from  one  iteration 
to  the  next.  Typical  quasi-Newton  methods  update  the 
H  matrix  (i.e.,  H.+i  =  if,  -t-  Si)  so  that  H,+iy,  =  a,.  It 
seems  reasonable  to  expect  that  as  -*  H',  the  cor¬ 
rect  inverse  Hessian  matrix,  //„,  will  solve  several  of  the 
previous  equations  as  well.  That  is. 


J/m-Z 

^m-2  * 

^m!/fTv-3 

^m-3» 

ffmym-J  + 1 

where  J  is  (arbitrarily)  the  number  of  parameters  to  be 
estimated.  If  this  is  true,  then  we  may  use  ||//my,  - 
‘^•ll/lkill  lo  measure  the  adequacy  of  the  //  approxima¬ 
tion. 
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4.  Computational  Methods 

The  BFGS  quasi-Ncv.ton  algf'-ith’r.  was  implemented 
as  a  subroutine  of  a  program  that  calculates  analytic 
derivatives  and  includes  an  optimal  step  size  routine. 
This  program  also  calculated  the  norm  measures  for 
each  iteration  eis  described  above.  All  programs  were 
written  in  Fortran-77  and  were  run  on  zm  IBM  3090 
system  at  the  National  Institutes  of  Health.  All  cal¬ 
culations  were  performed  in  double  precision.  Machine 
epsilon  for  this  system  is  approximately  10"*®. 

The  theoretical  order  of  the  rate  of  convergence  was 
calculated  by  solving  Tornheim’s  equation  for  the  ap¬ 
propriate  number  of  parameters  (Borland  International, 
Inc.  1987).  Although  these  theoretical  results  apply  to 
the  convergence  of  the  parameter  vector,  we  used  the 
same  techniques  to  examine  the  convergence  behavior  of 
the  inverse  Hessian  approximation  {H).  The  observed 
order  was  calculated  using  the  linear  regression  function 
of  Lotus  1-2-3,  version  2.01  (Lotus  Development  Corp. 
1985).  However,  because  a  sharp  improvement  in  H  is 
expected  on  the  Jth  step  when  sufficient  information  is 
available  to  approximate  the  J\J  matrix  (McCormick 
1983,  p.l98),  we  omitted  this  step  from  the  regression 
data. 

5.  Results 

5.1.  Example  1:  Logistic  model 

Data  for  the  first  example  are  from  a  case-control  study 
of  laryngeal  cancer  among  white  male  residents  of  the 
Texas  Gulf  Coast  area  (Brown  1988).  A  total  of  209 
cases  and  250  controls  (or  their  next  of  kin)  were  suc¬ 
cessfully  interviewed  to  obtain  information  on  their  usual 
consumption  of  alcohol  and  tobacco,  as  well  as  informa¬ 
tion  on  other  potential  risk  factors  for  this  tumor.  A 
prospective  logistic  model  was  used  to  estimate  the  rel¬ 
ative  risk  for  laryngeal  cancer  due  to  joint  exposure  to 
alcohol  and  tobaxco,  adjusting  for  age.  The  likelihood 
to  be  maximized  is: 


Results  from  a  program  using  Newton’s  method  with 
exau;t  first  and  second  derivatives  (Harrell  1986)  show 
that  the  maximum  likelihood  parameter  estimates  are 
X"  =  (—2.81,0.81,1.28,0.36)'.  As  shown  in  Figure  1, 
Xj  converged  quickly  to  these  values;  after  6  iterations 
the  norm  of  the  relative  error  was  0.0001.  H  converged 
to  its  correct  values  (Jf')  less  quickly;  as  expected,  there 
was  a  dramatic  improvement  on  the  fourth  step.  After 
this  improvement  in  If  the  vector  of  first  derivatives  (y) 
dropped  rapidly  to  0. 

Results  of  the  linear  regression  procedure  estimated 
the  order  of  convergence  of  X^  to  be  1.20,  compared 
to  the  predicted  asymptotic  order  of  1.33.  A  similar 
calculation  for  Hi,  excluding  step  4,  estimated  an  order 
of  convergence  of  1 .06. 

In  practice,  when  the  correct  results  are  unavailable, 
a  commonly  used  stopping  riile  (i.e.,  convergence  criter¬ 
ion)  is  to  require  a  "smair  value  of  the  maximum  rel¬ 
ative  parameter  change  from  one  iteration  to  the  next. 
That  is,  for  step  i 


max 

> 


-  l.J 


<  £ 


(9) 


where  £  is  a  small  positive  number  and  j  indexes  the 
parameters  in  the  vector  X. 

For  this  example,  as  shown  in  Figure  2,  we  might 
choose  to  stop  after  iteration  7,  where  the  maximum 
relative  parameter  change  was  less  than  0.001. 

It  is  less  clear  from  Figure  2  when  H  has  improved 
sufficiently  to  declare  convergence.  In  f2w:t,  this  measure 
of  convergence  did  not  decrease  monotonically  over  the 
11  iterations  shown.  As  described  previously,  we  calcu¬ 
lated  ||/Im!/i  —  <^i  ||/lki  ll  for  each  «  <  m  -  1.  Figure  3 
shows  the  results  for  m=7,8,9  and  10.  Unlike  the  con¬ 
ventional  convergence  criteria  described  above,  we  do 
see  an  improvement  with  each  iteration;  for  this  exam¬ 
ple  ffg  and  Hg  both  satisfy  the  previous  4  equations  to 
within  a  tolerance  of  0.02. 


5.2.  Example  2:  Normal  mixture  model 


LIZ  Xl  ==  fr  +  XlZlr  +  XjZjr  +  XiZ,r)\<‘' 

^  l+exp(Xo  +  X,Xu  +  X,2„-f  X,7,.) 

where 


dr 

Zlr 

^Jr 

^3r 


(  1  if  person  r  had  laryngeal  cancer 
(  0  if  not 

packs  smoked  per  day  by  person  r 

{1  if  person  r  was  a  heavy  alcohol  drinker 
0  if  not 

I  1  if  person  r  was  age  60+ 

0  if  not 


Because  the  logistic  example  converged  in  so  few  itera¬ 
tions,  we  repeated  the  maximization  on  a  more  complex 
likelihood  function.  Data  were  simulated  for  a  mixture 
of  normal  distributions,  with  400  points  generated  from 
a  N(0,1)  distribution  and  100  from  a  N(4,l)  distribu¬ 
tion  (SAS  Institute,  Inc.  1985).  The  likelihood  to  be 
maximized  was: 


L(Z;X) 


-(X. -/£.) 


i 


+ 


(>  -  P) 

CTj 


exp 


2<^l  Ji 


(10) 


X  =  (io,a:i,is,is)'  are  the  parameters  to  be  esti-  where  Z,  is  the  observed  data  vector  for  person  r,  p,  and 

mated,  where  X,  is  the  parameter  corresponding  to  each  Mz  the  means  and  <T|  and  Oj  the  standard  deviations 

Zj,j  =  1,2,3,  and  Xq  is  a  constant. 
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for  the  two  normal  distributions,  p  is  the  proportion  of 
the  total  in  distribution  1,  and  X  =  (p,  Mi i  <^11/^11 
We  3i£sume  pi  <  pj  so  that  the  solution  is  unique. 

Using  a  Newton  method  with  exact  derivatives  we 
calculated  the  true  maximum  likelihood  parameter  esti¬ 
mates  to  be  X*  =  (0.811,-0.0315,1.02,4.11,0.900)'.  As 
shown  in  Figure  4,  the  patterns  of  convergence  to  the 
true  solution  follow  those  seen  in  the  logistic  example. 
After  10  iterations  the  norm  of  the  relative  error  of  the 
parameter  vector  W2is  approximately  0.0001.  Following 
a  sharp  improvement  on  step  5,  H  improved  steadily 
and  y  converged  to  0. 

The  order  of  the  rate  of  convergence  for  the  param¬ 
eter  vector  wcis  1.17,  compared  to  an  asymptotic  pre¬ 
dicted  order  of  1.28  for  5  parameters.  A  similar  calcu¬ 
lation  for  H ,  again  excluding  the  sharp  drop  on  step  5, 
shows  the  order  of  convergence  to  be  0.98. 

ITse  of  the  maximum  relative  parameter  change  to 
define  convergence  would  lead  to  stopping  after  itera¬ 
tion  12  for  t  =  0.001  (Figure  5).  As  with  the  logistic 
model,  this  measure  does  not  monotonically  decrease 
for  H.  Application  of  each  Hi  to  previous  iterations 
showed  steady  improvement  with  each  iteration  (Figure 
6).  Hn  solves  the  previous  5  iteration  equations  within 
a  tolerance  of  0.02. 

6.  Conclusions 

For  two  representative  maximum  likelihood  estimation 
problems,  the  order  of  the  rate  of  convergence  of  the  pa¬ 
rameter  vector  to  the  correct  solution  was  nearly  equal 
to  its  predicted  asymptotic  value.  The  inverse  Hessian 
approximation,  the  estimated  asymptotic  variance  ma¬ 
trix  for  the  parameters,  showed  an  order  of  convergence 
slightly  less  than  that  of  the  corresponding  parameter 
vector,  but  converged  at  least  linearly  in  both  exam¬ 
ples.  The  proposed  norm  measure  of  the  closeness  of 
the  inverse  Hessian  matrix  approximation  to  its  correct 
values  appears  promising,  showing  an  improvement  at 
each  iteration  and  agreeing  with  our  decisions  to  declare 


convergence  when  the  true  matrix  values  are  known. 
Addition  of  a  criterion  such  as  this  to  the  standard  stop¬ 
ping  rules  should  increase  the  number  of  iterations  per¬ 
formed  to  ensure  that  the  variance  matrix  eis  well  as  the 
parameter  vector  is  reasonably  close  to  the  true  values 
or  identify  situations  where  the  values  are  suspect.  Al¬ 
though  the  examples  used  here  seem  sufficiently  difficult 
nonlinear  optimization  problems,  a  wider  range  of  like¬ 
lihoods  needs  to  be  examined.  We  encourage  continued 
work  in  this  area  so  that  statisticians  may  use  the  newly 
available  computer  methods  with  confidence. 
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Figure  1.  Convergence  of  Estimates  of  Parameters  (X),  First  Derivatives  (y),  and 
Inverse  Hessian  Matrix  (//)  to  True  Values  for  Logistic  Model  Example.  Normed 
relative  difference  =  ||Xi  —  X*|1/I|X"||  for  parameter  vector  X,  estimated  at  itera¬ 
tion  i,  when  true  parameter  values  =  X*;  similar  definitions  for  y  and  H. 
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Figure  2.  Maximum  Relative  Change  of  Estimates  of  Parameters  (X)  and  In¬ 
verse  Hessian  Matrix  (//)  for  Logistic  Mode!  Example.  Maximum  relative  change 


from  iteration  i  -  1  to  i  =  max^ 


X '  X  •  I 

~  for  X;  similar  for  H . 


n/) 


llrralioit  <* 

Figure  3.  Solution  of  the  Equation  //y  -  ct  by  the  Inverse  Hessian  Approxi¬ 
mations  from  Iterations  7,  8,  9,  and  10  using  y  and  a  from  Previous  Iterations, 
Logistic  Model  Example. 
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||//y^  -  <7  ||/j|a  II  M:>xmHiin  Kelalivo  Chaiijic  Normod  liolativc  Difference 
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Iteration  » 

Figure  4,  Convergence  of  Estimates  of  Parameters  (X),  First  Derivatives  (y),  and 
Inverse  Hessian  Matrix  (H)  to  T>ue  Values  for  Normal  Model  Example.  Normed 
relative  difference  =  ||A’i  -  X*||/||X*||  for  parameter  vector  Xi  estimated  at  itera¬ 
tion  i,  when  true  parameter  values  =  X*;  similar  definitions  for  y  and  H . 
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Figure  5.  Maximum  Relative  Change  of  Estimates  of  Parameters  (X)  and  In¬ 
verse  Hessian  Matrix  (//)  for  Normal  Model  Example.  Maximum  relative  change 


from  iteration  i  -  1  to  i  =  maxj 


-  hj 
-  ij 


for  X;  similar  for  H . 
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Figure  6,  Solution  of  the  Equation  //y  o  by  the  Inverse  Hessian  Approxi¬ 
mations  from  Iterations  15,  17,  and  19  using  y  and  a  from  Previous  Iterations; 
Normal  Model  Fyxample 
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I.  INTRCIDITTION  AND  OV7'PVIF'W  rffll  W  flou  \ 

f(ul/'7.a;  -  lu,  ,  u,i  •'  (.M) 

rtic  ordered  Dirichlet  distribution  bus  proved  ^  i  ■* 

n  Hi^ay  J  1 

to  be  ^  meani ni^ful  prior  distribution  in  a  variety  ef  j_l 


applications  including  bioassay  '  Ramsey  (1972)  life 
testing  1  l.ochner  (1975)  ],  damage  response  [  Maz/uchi 
and  Singpurwalla  (1982)  ],  failure  rate  estimation 
[  Mazzuchi  and  Singpurwalla  (1983'  1,  and  accelerated 
life  testing  !  Mazzuchi  (198b)  j.  In  the  above- 
situations  the  distribution  is  used  as  a  prior 
distribution  for  a  sot  of  ordered  probabilities.  There 
are  three  reasons  that  the  ordered  Oirtchlel 


with 

a  a, 

>  J  9,  J  1 ,  .  ,  k  1  I ,  and 

ki  I 

Z  1- 

J  1 

I'he  joint  dislnbiilion  is  defined  over  the  simplex 


distribution  is  so  appealing, 

I  .  U„  .  U,  ^  U^  1,  U^  -  ^kll 


i.  It  imposes  no  other  restriction  on  the 

thus  preserving  the  desired  ordering.  It  is  easy  to 

probabilities  other  than  the  desired 

see  that  this  distribution  arises  as  a  result  of 

ordering. 

specifying  a  Dirichlet  distribution  on  the  successive 

li.  It  allows  for  easy  incorporation  of  prior 

forward  differences  of  the  above  variables. 

information. 


ill.  It  is  mathematically  tractable  and  allows 
for  closed  form  posterior  results. 

While  the  first  two  claims  are  valid,  the  third 
IS  only  partially  true.  It  is  possible  to  obtain  results 
in  closed  form,  however,  evaluation  of  some  posterior 
quantities  may  require  large  amounts  of  computer  time 
and  may  be  subject  to  error  due  to  numerical 
manipulations.  These  errors  are  a  function  of  the 


2.1  Prior  Results 

Prior  information  may  be  directly  incorporated 
through  the  prior  parameters  by  noting  that 

kil 

E  (2.2a) 

J  1+1 


VarluJ 


K|u.|  •  [l  -  I'luj) 
0+1  ^ 


(2.2b) 


sample  size  and  tend  to  increase  as  the  sample  size 
increases.  Wjth  larger  sample  sizes  we  therefore 
offer  as  an  alternative  the  use  of  the  posterior 
approximation  technique  developed  by  Tierney  and 
Kadanc  (1986)  for  obtaining  posterior  quantities. 

In  Section  2  we  present  an  overview  of  the  use 
of  the  ordered  Dinchlct  distribution  with  both  prior 
and  posterior  results.  In  Section  3  we  present  an 
overview  of  the  Tierney  Kadanc  method  and  show  the 
ease  with  which  this  method  can  be  used  to  obtain 
the  posterior  quantities  of  section  2.  In  Section  4  we 
give  some  closing  comments. 

2.  THK  ORDRRFD  DIRIf’tll.HT  DISTRIBUTION 


and  thus  if  u|,  ,  uj^  are  the  prior  best  guess  values 
for  U),  ,  U|^,  defining  u*  j  “^i*  ’  '  l>  , 

kil  (with  uq  I  and  uj^^l  0)  we  obtain  a  joint 
distribution  whose  marginal  mean  values  arc  our  prior 
best  guesses.  In  addition,  the  parameter  0  may  be 
specified  in  such  a  way  as  to  indicate  the  strength  of 
conviction  in  those  prior  best  guess  values.  This  is 
true  since  once  the  <j ^  are  specified,  ,0  controls  the 
magnitude  of  the  variance. 

2.2  Posterior  Results 

Without  getting  into  specific  problem  scenarios, 
the  general  form  for  the  likelihood  in  problems  using 
the  ordered  Dirichlet  distribution  is  given  by 


The  ordered  Dirichlnt  distribution  defined  for  a 
set  of  variables  u  (Uj,  u  ,,  ,  Uj^)  is  given  by 


k 

Z.(  n,  s;  u  )  a  |~|  (1 

J  1 


(2.3) 
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where  nj  and  ere  quantities  used  for  estimating  u^ 
(essentially  n^  indicates  the  sample  size  for  estimating 
U|  and  S|  is  often  the  number  of  failures  recorded  out 
of  n|).  The  posterior  joint  distribution  for  is  thus 
proportional  to  the  product  of  (2.1)  and  (2.3'  and  this 
is  given  by 


j  1 


‘  ^“k.i  ' 


I  ,  ,  k 

I  *  ^“m  t  ,  bj) 

j  J  .1 _ _ 

I  PI  *4  2  am  ^  ,  bj) 

I  J  1 


PI  b(i/2  fZam  +  b^^,  ,  bj 

jJ: 

I  PI  b(  Za^lb^,,  .bj) 

I  J-C*l  ">  J 


We  can  expand  the  (1  Uj)  terms  in  a  binomial 
series  yielding 


”l  ^ 


z  z 

£,=0 


Uj  lUj  ,  Uj)  u^ 


This  can  be  expressed  as  a  weighted  combination  of 
densities  of  a  form  proportional  to 


for  c  -  d  and  ,1^2  0. 

The  posterior  joint  distribution  (2.4)  can  thus 
be  expressed  as 


n|“j' '  V  l“i 

k“  t 

P|b(  Z  (^ni  +  '’m“Smi^“ni+l^  ’ 

J.  1  "’-J 


n  “I  '1^,-"/^  - 

J  1 


The  above  density  is  similar  to  the  generalized 
Dirichlet  density  studied  by  l.ochncr(1975)  and  Connor 
and  Mosimann  (1%9).  The  constant  of  integration  and 
thus  the  moments  for  this  distribution  may  easily  be 
obtained  through  repeated  use  of  the  integral  identity 
(CRC  TABLES  Definite  Integral  Formula  609) 


(x  a)™  (b  x)"  dx  (b  a)"  ’  "  *  '  B(m  )  1  .n +1) 


W(£  ,s,n  ,oi,/3) 


fi  El 

J  1 


b(  Z  t^m  '  "m  Sm)  t  ^“m  d  ’  ^'^j) 

(2.10) 


''k 

w  -  E  Z 
«,.0  £,  =  0 


u  mil  I  11  r(m  +  l)  r(n  t  I)  .1,  1.  . 

where  B(m  4  1  ,n  f  I)  — -  , — , - ,  is  the  beta 

r(m  Tn  F2) 


function.  Thus  the  constant  of  integration  for  (2.6)  is 
obtained  as 


Once  the  weights  W{^^,n,a,i3)  and  W  are  obtained, 
posterior  joint  moments  for  (2.9)  arc  obtained  as  a 
weighted  combination  of  moments  of  the  form  (2.8), 


PI  b(  E  »m  i  b^^,  ,  bj) 
J  1  "“J 


E  Z  Hluc"'  u/^ 

£,,0  £,^  0  ^ 


E|uc  '  Uj  '  I  (£  tn  sKli3  a)l 


I/,  Uy 

and  joint  moments  Efu^  Uj  lB,b|  for  (2.6)  arc 
obtained  as 


Though  the  expressions  (2.9),  (2.10),  and  (2.11 
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are  closed  form  expressions,  they  may  be  difficult  to 
evaluate.  The  time  required  to  evaluate  such 
expressions  is  a  function  of  s  (and  thus  n)  and  k.  In 
addition,  when  the  nj  are  large,  computer  evaluation 
of  the  required  beta  function  can  lead  to  significant 
numerical  error.  This  is  due  to  the  fact  that  because 
of  the  summations  involved,  the  required  argument 
values  often  far  exceed  the  maximum  allowable  values 
specified  for  accuracy.  One  alternative  is  to  use  the 
regeneration  formula  for  the  gamma  function  and 
factor  thus  greatly  simplify  expressions  (2.9),  (2.10), 
and  (2.11).  In  so  doing  we  may  rewrite  (2.10)  and  (2.8) 


V(«,s,n,a,;3)  -  k 
J  1 


nl' 
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hh 

fl 


U(u)  e  ■  duj  ■•dU|^ 


e  ■  dU|  'dU|^ 


A,  ,  L(u)  I  n(u) 

L(u)  —  Log  of  the  likelihood 

n(u)  s  Log  of  the  prior  joint  distribution  for  u 

U(u)  function  of  u. 

3.1  General  Results 

Based  on  the  Laplace’s  method,  the 
approximation  to  (3.1)  is  given  by 
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where  A’  ^  0(0)  +  and  0(u)  Log(U(ui)  and  u*  (  u  ) 

IS  the  joint  mode  of  A*  (A)  and  minus  the 

inverse  Hessian  of  A’(A)  evaluated  at  u*  (  u  )  and 
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N  n  ,  Note  thai  duo  to  the  fact  that  the  prior 

J=1  ^ 

distribution  is  defined  over  the  simplex 


the  integration  is  also  subject  to  this  restriction  and 
therefore  so  is  selection  of  the  joint  modal  values. 

I'or  obtaining  the  posterior  marginal 
distribution  of  say  u^  which  was  not  discussed  in 
Section  2  due  to  its  increased  complexity,  we  note 


respectively.  A  numerical  problem  may  still  exist  for 
large  n^  in  that  the  evaluation  involves  differences  of 
products  of  numbers  very  close  to  1.  While  further 
numerical  techniques  can  be  employed  we  suggest 
posterior  approximation  techniques  as  an  alternative. 


p(  u  i  s  ,n  1 
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3.  rilL  Tir-RNF.Y  KADANr- 

POSIFPIOR  APPROXIMA  I'ION  TFCHNIQIH' 

All  Uic  posterior  quantities  of  Section  '  can 
be  obtained  usirm  the  Tierney  -  Kadanc  approNimatitjn 
technique  for  evaluating; 


I’cr  .«  j;ivcn  u^  let  be  the  mode  .T  A  for  fixed 

at  its  value  and  'call  this  funcMon  Note  that 

Li'Uj'  IS  an  m  !  by  1  vector  and  lot  denote  the 
minus  inverse  Jiussian  "f  evaluated  at  (this 

should  have  1  less  rank  then  that  of  1  hen  the 

posterior  m«rt;in-i)  -.J  is  I  r  i  hu  1 1<  n  -T  is  jiivcn  by 
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p(u-  I  Data) 


det( 

2tN  det(  2 


)]'exp{ 


Aj(  u(u,))“  A(u) 


(3.4) 


Thus  all  the  desired  posterior  quantities  of  Section  2 
may  be  obtained  and  in  addition  we  may  obtain  the 
posterior  marginal  distribution  for  each  Uj. 


3.2  The  Optimization  Problem 

Usually  the  major  difficulty  in  using  the 
Tierney  Kadane  method  is  in  solving  two  separate 
optimization  problems.  This  is  particularly  true 
when  dealing  with  a  constrained  problem  as  we  have 
here.  However,  we  can  show  that  for  this  problem 
the  derivative  cicpr^..  uns  are  straigi. Ifo.  ward  and  in 
addition,  with  a  simple  Iransform.ition  of  the 

parameters  we  can  convert  our  problem  to  an 
unconstrained  problem. 

The  log  of  the  posterior  distribution  N  A(u) 
given  by 
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3'o  convert  to  an  unconstrained  optimization  program 
we  select  the  following  roparamcteriz.ation 
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or  inversely. 
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Ma.Mmiza  tion  of  the  rcparamelcri.'cd  problem  is 
f  icilitatcd  by  replacing  (3.6)  with 


rtic  additional  terms  *  which  arc  easily  obtained  as 
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I'nder  rcascnablo  choices  for  the  prior 
pur.imctcrs,  'specifically  for  ^  1  for  all  t,  the 

ri'n<'tion  "P  IS  Ruaranteod  to  be  concave  and  this 
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appears  to  be  true  for  any  definition  of  the  prior 
parameters  provided  the  nj  arc  large  enough.  Similar 
arguments  can  be  used  to  show  that  for  reasonable 
functions  U(u)  the  function  T*  N  {A*}  is  also 
concave.  Thus  the  optimization  is  really  no  problem. 

4.  CONCLUSIONS 

The  Tierney  Kadane  approximation  technique 
appears  to  make  reason  iii  of  Section  I  considerably 
more  true  and  allows  for  the  application  of  the 
ordered  Dirichlet  distribution  in  large  sample 
situations.  In  a  future  paper  wc  will  give  the  results 
and  comparison  of  numerical  calculation,  numerical 
integration,  and  the  Tierney  Kadane  approximation 
technique,  for  obtaining  posterior  quantities  for 
various  combinations  of  n^,  s-  and  k. 
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COMPARISON  OF  “LOCAL  MODEL”  STATISTICAL  CLASSIFICATION  METHODS 


Daniel  Normolle.  University  of  Michigan 


Introduction 

The  different  methods  of  statistical 
classification  may  be  divided  into  two  groups; 
those  which  require  assumptions  concerning  the 
class-conditional  distribution  functions  le.g.,  linear 
discrimination,  logistic  regression^;  those  which 
classify  observations  depending  upon  the  class 
membership  of  the  nearby  observations,  such  as 
nearest  neighbor  classification  and  CART.  This 
paper  is  concerned  with  a  comparison  of  several  of 
the  latter,  “local”  methods,  taken  from  a  Monte 
Carlo  study  performed  at  the  State  University  of 
New  York  at  Binghamton  and  the  University  of 
Michigan  which  examines  various  aspects  of  22 
statistical  classification  methods  over  12,000  data 
sets.  The  estimated  rates  of  correct  classification 
for  the  various  methods,  analysis  of  the  differences 
in  performance  of  the  methods,  and  cbar?»''t''ristics 
of  the  optimization  techniques  used  in  the  various 
methods  are  presented.  In  particular,  the  use  of 
cross-validation  to  select  the  neighborhood  size  in 
the  nearest  neighbor  method  is  discussed. 

Methods 

The  local  classification  methods  have  in 
common  the  rule  that,  in  general,  observations  are 
assigned  to  the  same  class  as  their  neighbors. 

Each  local  classification  method  has  a  different 
definition  of  what  a  neighborhood  is,  and  these 
different  definitions  constitute  the  differences 
between  the  methods.  Each  of  the  methods 
requires  that  the  neighborhood  size  be  optimized 
by  some  method  to  prevent  neighborhoods  which 
are  too  small  and  hence  overfit  the  data,  or  which 
are  too  large  and  miss  important  features  of  the 
measurement  space.  For  each  of  the  following 
methods,  the  neighborhood  process  will  be 
described,  along  with  the  method  to  optimize  the 
neighborhood  size. 

The  dimension  of  the  measurement  space  is 
represented  by  p,  the  number  of  classes  by  d,  and 
the  number  of  observations  in  the  training  sample 
d 

by  n  =  ^  nj.  A  p-dimensional  observation  is 
i=l 

denoted  by  x,  with  component  x^.  The 
proportion  of  the  population  in  the  i^  class  is  tt., 
and  the  class-conditional  density  function  is 
written  f<  ).  The  sample  mean  of  the  class  is 

written  Xj,  and  the  common  sample  covariance 
matrix  is  S. 


The  classification  tree  method  lOTREE),  a 
refinement  of  an  older  technique  called 
classification  by  statistically  equivalent  blocks,  is 
described  in  detail  in  Breiman  et  al  (I984.i.  The 
measurement  space  is  recursively'  partitioned  into 
rectangles  by  cuts  perpindicular  to  the 
measurement  axes,  and  a  classification  rule  is 
assigned  to  each  rectangle  based  on  the  class 
membership  of  the  observations  within  the 
rectangle.  The  recursion  is  halted  when  the 
rectangles  contain  observations  from  only  one 
class,  or  contain  only  one  observation.  The 
structure  of  the  classification  rule  is  represented 
by  a  binary  tree,  where  the  cuts  are  placed  at  the 
non-terminal  nodes,  and  each  rectangle  is  assigned 
to  a  terminal  node. 

The  recursive  rule  as  described  tends  to 
construct  trees  which  overfit  the  data,  resulting  in 
over-optimistic  estimates  of  classification  rates  on 
training  data,  and  poor  performance  on  subsequent 
data  sets.  The  tree  size  is  optimized  by  growing 
ten  auxiliary  trees,  each  based  on  nine-tenths  of 
the  original  training  set,  and  then  using  the  hold¬ 
out  sample  consisting  of  the  remaining  data  points 
to  estimate  the  true  correct  classification  rate. 

The  trees  are  used  to  determine  the  value  of  a 
cost-complexity  parameter  which  penalizes  both 
overcomplicated  trees  and  high  misclassification 
rates.  The  value  of  this  parameter  is  used  to 
“trim”  the  main  tree,  by  combining  the  rectangles 
with  small  numbers  of  obser-  ations  to  achieve  a 
tree  which  represents  the  data,  but  does  not 
overfit  it. 

The  kernel  PDF  estimation  method  of 
statistical  classification  (KERNEL)  directly 
estimates  f(x)  using  only  weak  assumptions  about 

the  functional  form  of  the  fi),  and  then  assigns 

class-membership  depending  upon  the  values  of 
the  estimated  density  functions.  The  application 
of  density  estimation  to  statistical  classification  is 
described  in  Hand  (1981 1.  The  density  estimator 
1  o 

at  x  =  tx  .....x^)'  is: 


f  -ixi 
1 


1 


P 


n. 

1 

E 

k  = 


1 


P 

Rk 

j=l 


\ 

/ 


where  Ki  i  is  a  symmetric,  univariate  probability 

density  function  and  x"!,  is  the  element  of  the 
ik 

k—  member  of  the  class  in  the  training  sample. 
The  use  of  a  density  function  for  K  ensures  that 
nearby  points  will  contribute  more  to  the  density 
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estimate  than  distant  points.  The  kernel  function 
of  Epanechnikov  (1969), 

2 

K(u)  =  max{— ^  — ,0}, 

4^5  20^5 

which  asymptotically  minimizes  the  mean 
integrated  square  error  for  a  large  class  of 
univariate  density  functions  (Tapia  and  Thompson, 
1978)  is  used;  it  has  the  additional  advantage  of  a 
bounded  support,  which  reduces  computational 
cost. 

The  smoothing  parameters  hy  are  calculated 

independently  for  each  class  i  and  coordinate  j  by 
an  iterative  method.  The  process  is  initialized  by 
the  range  of  the  training  sample; 

h?  =  maxlx"!  ..,x-?  }  -  minlx-^  ,x-!  }, 

ij  ‘  il  m.'  '  il  ’  m.^‘ 


and  is  updated  according  to. 


is  first  calculated: 


where; 


,s+l  —  1.'5  ,,r,,-.S 
h..  =  n.  aiK)dlh..  , 

ij  1  U 


_  12v^,l,'5 

a(K)-  (-j^) 

A  g 

and.  if  f  ..(•  i  is  the  estimate  of  (f.{- 1  using  the 

y  y  ^ 

training  sample  and  h?., 

5(h®.)  =  (njl  f®."(y;]Vvr 

The  resulting  density  estimates  for  each  class  are 
then  calculated  at  a  test  point  x,  and  the  test 
point  is  assigned  to  the  class  having  the  largest 
estimate. 

The  k-ncarest  neighbor  method  (KNN,  Fix  and 
Hodges,  19511  is  supplied  by  the  analyst  with  an 
integer  k,  and  then  classifies  a  point  x  according  to 
the  class  memberships  of  the  k  observations  from 
the  training  sample  which  are  closest  in  the 
measurement  space  to  x.  Distance  is  measured  by 
the  Mahalanobis  distance, 

dix.vi  =  (x-v)'S  ^(x-vt 


d.,  =  (x.  -x.,)’(S.  )  lx.  -X.,), 

jl  im  jl  im  im  jl 


(j=l,...,d  and  1=1,. ..,nj) 


(where  (S.  i  ^  is  the  inverse  of  the  common 
im 

covariance  matrix  with  the  im —  observation 

removed.  This  vector  of  n- 1  distances  is  sorted 

along  with  the  class  memberships  of  the  training 

observations.  Then,  starting  with  k  =  l,  a  running 

tally  is  kept  of  the  class  memberships.  For  each  k, 

the  k  class  memberships  vote,  resulting  in  an  class 

estimate  for  X.  for  each  value  of  1  <k<n-l.  This 
im  —  — 

procedure  is  repeated  for  all  the  points  in  the 

training  sample,  and  the  number  of  points 

correctly  classified  for  each  value  of  k  is 
♦ 

accumulated,  k  is  then  selected  as  that  particular 
value  of  k  which  maximizes  the  number  of 
correctly  classified  training  points  (actually,  a 
three-point  moving  average  is  maximized),  yielding 
an  optimized  neighborhood  size  for  the  nearest- 
neighbor  method.  Each  observation  in  the  testing 

sq,mple  is  then  classified  using  the  k  nearest 
neighbors  in  the  training  sample. 


The  last  method  is  not  included  as  a  “local 
model”  method,  but  because  it  represents  an 
interesting  bridge  to  the  global  methods  in  the 
original  study.  Each  observation  in  the  training 
sample  is  replaced,  coui  dinate-by-coordinate,  by  its 

normal  score.  The  normal  score  of  the  i^  largest 
value  in  a  set  of  numbers  {xj,...,x^}  equals 


,  where  '  is  the  standard  Gaussian 


cumulative  distribution  function.  The  testing 
observations  are  ordered  independently  of  the 
training  observations.  The  conditional 
discriminant  function  performs  a  linear  or 
quadratic  di.scriminant  analysis  conditional  on  a 
test  of  the  hypothesis 


H  S  =  ^ 

0  J  2' 


The  choice  of  the  size  of  the  neighborhood, 

determined  by  k,  is  as  problematic  with  the  k- 

nearest  neighbor  method  as  with  those  previously 

mentioned.  Here,  cross-validation  is  used  to 

determine  k,  producing  a  cross-validated  k-nearest 

neighbor  (XKl^K).  Each  observation  x.  in  the 

im 

training  set  is  classified  using  the  other  points  in 
the  training  set  as  follows.  The  Mahalanobis 
distance  to  every  other  point  x.jin  the  training  set 


where  Sj  is  the  dispersion  matrix  of  the  i—  class. 

Sj  is  the  within-class  sample  covariance  matrix, 

calculated  from  the  transformed  data,  and  |  S|  is 
the  determinant  of  S.  The  test  used  in  the 
simulations  is  a  special  case  (for  Oj  =  n,^  and  d  =  2i 
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of  Box’s  (1949)  modification  to  Wilks’  A:  calculate 

(2p^  +  3p-l( 

2(n-2)(p+l)’ 


and  reject  the  hypothesis  of  equality  of  dispersion 

matrices  if  -  2  rho  log(A)  exceeds  the  90^ 

2 

percentile  of  a  x  distribution  with  p(p  +  1  )/2 
degrees  of  freedom.  The  symbol  NCDF  is  used  to 
represent  the  use  of  the  conditional  discriminant 
function  on  the  normal  scores  of  the  data. 

Since  Bayes  rule  maximizes  the  expected 
probability  of  correct  classification  (Click,  1971),  it 
shall  be  used  as  a  benchmark  against  which  the 
other  methods  will  be  compared.  The  version  of 
Bayes  rule  used  here,  which  is  less  than  general 
but  sufficient  for  the  circumstances  of  the 

experiment,  where  d==2  and7rj=7r,j=  -g,  is: 

Classify  x6i  if  fj(x)  =  max{fj(xi,f2(x(}, 

where  f^(-),  fo(')  are  known  class-conditional 

probability  density  fuiiCiions.  Bayes  rate,  the 
proportion  of  observations  correctly  classified  using 
Bayes  rule,  will  be  estimated  by  calculating  the 
class-conditional  density  functions  at  each  testing 
sample  point,  which  is  possible  since  these  density 
functions  are  known  precisely  in  a  simulation 
experiment.  The  correct  classification  rates  of  the 
individual  methods  will  be  reported  as  the 
percentage  of  the  Bayes  rate. 

The  full  details  of  the  Monte  Carlo  Experiment 
are  presented  in  Normolle  1 1987).  The  simulations 
are  written  in  FORTRAN,  compiled  on  the  IBM  H 
optimizing  compiler,  and  executed  on  an  IBM  4381 
at  the  State  University  of  New  York  at 
Binghamton,  and  on  the  IBM  3090-400  at  the 
University  of  Michigan. 

The  experiment  is  a  full\ -crossed  5x  2x  2x  3x  2 
design  with  five  levels  of  variation:  distribution 
type  iCaussian,  Cauchy,  lognormal,  bimodal, 
uniform;;  dimension  i2,  6);  training  sample  size 
1 40,  80,  160);  between-class  separation  (low.  high); 
within-class  dispersion  (equal,  unequal;.  A 
multiplicative  congruential  generator  with 

multiplier  7'  and  base  2  —  1  generates  the 

primary  [0,1]  random  variates,  which  are  then 
transformed  to  specific  standardized  distributions 
by  well-known  methods  le.g.,  Gaussian  variates 
are  obtained  by  the  Box-Muller  transformation) 
described  in  the  thesis.  Multiplication  by  rotation 
matrices  and  translation  by  location  vectors 
determine  specified  population  location  and 
dispersion,  chosen  to  achieve  (within  17f  i  a 


predetermined  Bayes’  rate  in  the  population. 

Each  design  point  was  replicated  100  times,  for  a 
total  of  12,000  training  sets.  All  classification 
methods  are  calibrated  on  every  training  set,  and 
evaluated  by  their  correct  classification  rate  on  a 
test  set  of  1000  observations  associated  with  the 
design  cell. 

Results 

The  result  of  optimization  on  the  classification 
tree  and  nearest  neighbor  methods  is  tested  by 
comparing  the  optimized  to  the  non-optimized 
version.  The  non-optimized  classification  tree 
(TREE  ;  is  the  tree  grown  on  the  training  sample 
without  cross-validation  pruning.  The  XKNN  is 
compared  to  the  KNN  rule,  where  k  is  the 
smallest  odd  integer  greater  than  the  square  root 
of  the  training  sample  size. 

It  is  seen  from  Table  1  that  pruning  has  a 
larger  effect  on  the  higher-dimensional  data,  and 
that  the  effiect  increases  with  the  training  sample 
size  on  both  the  2-and  6-dimensional  sets. 
Analysis  of  variance  on  the  difference  between 
TREE  and  OTREE  (not  displayed;  produces 
significant  main  effects  for  all  the  experimental 
variables. 

Cross-validation  selects  neighborhoods  which 
are  larger  than  the  square  root  rule  (Table  2)  and 


Table  1. 

Mean  Percent  of  Bayes  Rate 


%of  Bayes  Rate  Paired 

p  n  -  Comp. 

TREE  OTREE  t 

2  40  86.6  86.5  -2.0 

80  89.7  90.1  5.4 

160  92.4  93.6  16.6 

6  40  82.6  83.1  8.0"' 

80  87.6  89.4  18.8  “ 

160  91.9  93.8  21.9  "" 

p<0.000l 
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Table  2. 

Mean  Values  of  Cross-Validated  k 


Table  4. 

Mean  Percent  of  Bayes  Rate 


p 

n 

Mean  k 

2 

40 

9.28 

80 

17.18 

160 

30.32 

6 

40 

9.27 

80 

17.86 

160 

31.51 

Table  3. 

Mean  Values  of  k 


Distribution 

Mean  k 

Normal 

21.60 

Cauchv 

12.25 

Lognormal 

14.02 

Bimodal 

24.05 

Uniform 

24.25 

seem  to  be  only  slightly  affected  by  the  dimension 
of  the  data.  The  neighborhoods  are  smaller  on  the 
heavier-tailed  distributions  (Cauchy  and 
Lognormal,  Table  3),  and  largei  on  the  shorter- 
tailed  (Uniform  and  Bimodal).  An  analysis  of 
variance  (not  shown)  demonstrates  that  all  the 
design  variables  except  the  equality  of  the  within- 
class  dispersion  matrices  significantly  effect  the 
difference  in  the  classification  rates  between  the 
KNN  and  XKNN  rules.  As  seen  in  Table  4,  cross- 
validation  degrades  the  perfomance  of  the  nearest- 
neighbor  method  on  the  sparsest  data  (n  =  40  and 
80,  p  =  6),  but  as  the  concentration  of  data 
increases,  the  improvement  in  classification 
increases  substantially.  The  increase  is 
pronounced  in  situations  where  the  classes  are  not 
well-separated  (Table  5 1,  while  the  cross-validated 
and  square  root  values  of  k  produce  essentially  the 
same  results  when  the  classes  are  already  well- 
separated. 

The  iterative  process  to  calculate  smoothing 
parameters  for  KERNEL  t“nds  to  produce  values 
which  are  smaller  than  optimal  (the  optimal 
values  can  be  determined  exactly  for  some  of  the 
known  class-conditional  densities).  A  cross- 
validation  method  of  smoothing  parameter 
estimation  is  currently  being  implemented  to 
remedy  this  situation. 

Tables  6  and  7  present  the  mean  and 


n 

%of  Bayes  Rate 

Paired 

Comp. 

t 

P 

KNN  XKNN 

2 

40 

91.8 

92.8 

10.7 

80 

93.4 

94.8 

16.8 

160 

94.6 

96.5 

21.8 

6 

40 

90.2 

88.2 

-14.4 

80 

91.8 

91.5 

-3.1 

160 

91.9 

93.8 

13.8 

*  p<0.01 

p<0.0001 


minimum  percentage  of  Bayes  rate  obtained  by 
the  four  methods  considered,  ordered  by  dimension 
and  sample  size.  Each  mean  and  minimum 
displayed  are  based  on  2000  observations.  For  the 
2-dimensiona)  data.  XKNN  displays  the  highest 
average  standardized  classification  rate  of  all  22 
methods.  As  the  number  of  measurement 
variables  increases,  the  efficacy  of  NCDF,  as 
measured  by  the  mean,  actually  increases,  while 
the  KERNEL  and  XKNN  methods  are  degraded. 
At  the  lowest  sample  size,  OTREE's  performance 
is  inferior  to  the  other  methods,  but  it  improves  at 
n  =  80  and  n=  160.  In  addition,  OTREE's  response 
rate  at  n  =  160  is  relatively  unchanged  between 
p  =  2  and  p=6,  unlike  the  other  local  methods, 
which  seem  to  degrade  quickly  as  the  number  of 
measurement  variables  increases. 


Table  5. 

Mean  Percent  of  Bayes  Rate 


%of  Bayes  Rate 

Paired 

Comp. 

t 

1  U  4/ 1 « *  1 J 

KNN 

XKNN 

Low 

Hi 

89.3 

95  3 

90.7 

95.2 

19.4 

1.3 

**  p<0.0001 

519 


Table  6. 

Mean  Percent  of  Bayes  Rate 


n 

%of  Bayes  Rate 

p 

OTREE 

KERNEL 

XKNN 

NCDF 

2 

40 

86.5 

92.1 

92.8^ 

89.4 

80 

90.1 

93.7 

94.8^ 

91.8 

160 

93.6 

95.1 

96.5^ 

91.8 

6 

40 

83.1 

87.1 

88.2 

91.1^ 

80 

89.4 

88.8 

91.5 

93.4^ 

160 

93.8 

90.7 

93.8 

94.8 

^  Best  of  All  Methods 


Table  7. 

Minimum  Percent  of  Bayes  Rate 


Minimum  of  Bayes  Rate 


OTREE 

KERNEL 

XKNN 

NCDF 

2 

40 

49.9 

70.4 

71.0^ 

40.6 

80 

63.7 

70.3 

73.0^ 

43.9 

160 

74.6 

78.0 

79.6^ 

35.9 

6 

40 

34.6 

52.8 

1  " 

1  /  .4 

42.2 

80 

50.6 

56.4 

51.2 

41.6 

160 

69.2^ 

58.6 

55.2 

44.2 

^  Best  of  All  Methods 
Worst  of  All  Methods 


XKNN  performs  well  compared  to  all  other 
methods  with  respect  to  the  minimum  correct 
classification  rate  at  p  =  2,  and  competes 
reasonably  at  p  =  6,  except  for  a  disaster  which 
occurred  at  n  =  40.  KERNEL  is  notable  in  that 
the  minimum  rate  does  not  decrease  as  far  at 
small  sample  sizes  at  p  =  2  and  p  =  6  as  NCDF  and 
OTREE.  KERNEL  is  the  minimax  method  over 
all  12,000  observations. 

Table  8  displays  the  mean  percentage  of  Bayes 
rate  obtained  by  distribution  type  of  the  sample 
data.  Each  mean  is  based  on  2400  observations. 

NCDF,  which  works  remarakbly  well  with 
elliptical  distributions  (Normal  and  Cauchy),  even 
if  they  are  heavy-tailed,  breaks  down  substantially 
when  presented  with  data  from  the  very  skewed 
Lognormal  distribution. 

Analyses  of  variance  (not  shown  )  produce  very 
significant  main  effects  (,p<  0.0001)  on  the 
classification  rates  of  the  four  methods  studied. 

The  design  variables  account  for  42%  to  52%  of 
the  variance  of  the  local  model  methods,  and  72% 
of  NCDF. 


Table  8. 

Mean  Percent  of  Bayes  Rate 


Dist. 

Type 

%of  Bayes  Rate 

OTREE 

KERNEL 

XKNN 

NCDF 

Normal 

90.8 

93.6 

97.1 

100.0 

Cauchy 

98.3 

96.6 

97.3 

103.3 

Lognormal 

82.8 

81.0 

82.3 

63.4 

Bimodal 

92.2 

96.5 

97.2 

99.5 

Uniform 

82.9 

88.6 

90.8 

93.9^ 

^  Best  of  All  Methods 


520 


Discussion  and  Conclusions 

Generalizations  from  a  Monte  Carlo 
experiment  are,  of  course,  problematic,  so  we 
proffer  the  following  conclusions  and 
recommendations  with  the  usual  caveats. 

While,  at  the  sample  sizes  considered,  cross- 
validation  is  of  some  benefit,  the  computational 
cost  is  high  and  the  value  in  classification  power 
limited.  Thus,  cross-validation  requires  training 
samples  at  least  as  large  as  the  biggest  considered 
here  to  be  effective  even  on  very  Ic.g.,  p=2  or  3) 
low-dimensional  data. 

Since  it  is  based  on  the  marginal  empirical 
distribution  functions,  CTREE  performs  better  on 
the  higher-dimensional  data  than  does  either  the 
XKNN  or  KERNEL.  However,  the  low  mean  and 
minimum  rates  at  n  =  40  and  n  =  80  suggest  that 
CTREE  is  not  appropriate  at  these  small  sample 
sizes,  but  that  once  an  adequate  sample  size  (say, 
150  training  observations)  is  obtained,  more 
variables  may  be  included  in  the  analysis  than 
with  competing  local  model  methods. 

KERNEL  and  XKNN  offer  higher  average 
performance  at  the  lower  dimension.  KERNEL  is 
the  minimax  classifier  over  the  entire  experiment. 

NCDF  shows  promise  on  the  sparse,  higher¬ 
dimensional  data  when  the  sample  size  is  too  small 
for  the  effective  performance  of  the  classification 
tree,  but  is  subject  to  degraded  performance  when 
the  data  are  very  skewed. 

A'-  a  group,  the  local  model  methods  are  strong 
at  p  =  2;  if  the  analyst  is  unable  to  make  any 
assumptions  about  the  data,  sample  sizes  like 
those  considered  in  this  study  wilhyield  good 
results,  especially  with  the  cross-validated  nearest 
neighbor.  The  classification  tree  method  requires 
larger  sample  sizes  than  the  other  methods  even 
with  two  dimensions,  but  can  tolerate  more 
variables  once  this  barrier  is  overcome.  The  cost 
of  all  of  these  “assumption-free"  methods  increases 
rapidly  with  the  number  of  dimensions,  so  that 
either  some  dimension  reduction  technique  must 
be  applied  to  the  data  before  a  local-model  method 
is  applied,  the  sample  size  must  be  quite  large,  or 
a  rank-based  or  robust  global  alternative  must  be 
employed. 
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An  Example  of  the  Use  of  A  Bayesian  Interpretation  of  MDA  Results 
James  R.  Nolan,  Siena  College 


This  article  is  concerned  with  the 
interpretation  of  multiple  discriminant  analysis 
results.  Specifically,  a  demonstration  will  be 
made  of  the  usefulness  of  Bayesian  methods  for 
enhancing  the  utility  of  the  multiple 
discriminant  results. 

The  primary  objective  of  discriminant  analysis 
is  to  classify  cases  into  two  or  more  groups.  An 
implicit  assumption  for  using  this  technique  is 
that  the  groups  can  be  differentiated  based  on  a 
combination  of  multivariate  normal  variables.  In 
addition,  if  the  variances  and  covariances  of  the 
Independent  variables  are  equal,  or  nearly  so,  a 
linear  classification  model  is  optimal . 

Thus, 

D  =bx  +bx  +...+bx 
J  1  Ij  2  2j  n  nj 

where  j  =  group  number 
b  =  coefficient 
i 

The  general  proi.edure  for  conducting  an 
analysis  using  the  discriminant  method  is  as 
follows; 

(1)  a  priori  definition  of  a  sample  of  cases  in 
each  group. 

(2)  definition  of  the  variables  which  are  thought 
to  account  for  intergroup  differences. 

(3)  submission  of  the  data  to  an  MDA  (multiple 
discriminant  analysis)  algorithm. 

(4)  determination  of  the  "cutting  point"  or 
critical  discriminant  score  which  will 
separate  the  groups. 

(3)  ultimately,  you  want  the  probability  of 
specific  group  membership. 

Once  the  procedure  is  complete,  there  are 
various  methods  used  to  determine  the  value  of 
the  resulting  equation  -  value  as  far  as 
statistical  significance  is  concerned  as  well  as 
its  inference  or  predictive  ability.  Some  of 
these  procedures  are: 

eigenvalue  =  btwn.  group  SS/withln  group  SS 

The  larger  the  eigenvalue,  the  better  the 

equation  is  able  to  differentiate. 

canonical  correlation  =  the  degree  of  association 
between  the  discriminant  scores  and  the  groups. 

The  higher  this  value,  the  better. 

confusion  matrix  =  percentage  of  cases  correctly 
and  Incorrectly  classified. 

The  major  problem  with  this  last  method  is  that 
you  are  using  the  very  same  cases  used  for 
constructing  the  equation  to  determine  the  value 
of  the  equation  -  a  very  biased  view  results.  One 
way  to  get  around  this  is  to  divide  the  data  into 
two  groups  at  the  beginning  of  the  analysis;  then 
use  one  group  to  construct  the  equation  and  the 


other  group  to  determine  its  value.  The  problem 
here  is  that  you  lose  one  half  of  your  original 
data  when  constructing  the  equation.  A  better 
procedure  is  to  employ  the  "jacknife"  method 
(alternately  referred  to  as  the  "leaving  one  out" 
method)  whereby  you  construct  the  equation  using 
all  but  one  of  your  cases  and  then  proceed  to 
classify  that  "left  out"  case.  After  doing  this 
for  all  cases,  you  have  a  much  less  biased  view 
of  the  value  of  the  discriminant  equation. 

The  major  problem  with  stopping  the  analysis 
at  this  point  is  that  two  Important  items  are 
being  ignored: 

(1)  prior  probabilities  of  group  membership. 

(2)  incorporating  additional  information  about 
cases. 

These  two  items  can  be  included  in  the 
analysis  if  we  utilize  Bayes'  Rule.  If  the 
"cutting  point"  or  critical  discriminant  score  is 
placed  midway  between  the  mean  discriminant  D 
scores  for  each  group  (in  the  two  grcup  case), 
the  Implied  objective  is  to  equalize  the 
probabilities  of  misclasslfying  the  cases. 

In  many  situations,  we  know  that  there  is  a 
higher  probability  a  case  belongs  to  one  group 
versus  the  other.  If  your  objective  is  to 
minimize  misclasslf Ication,  period,  then  the 
cutting  point  should  be  moved  toward  the  mean  D 
score  of  the  smaller  group. 

How  far  should  the  cutting  point  be  moved? 
One  could  alternately  try  many  different  cutting 
points  to  find  the  best.  Needless  to  say,  this 
would  be  very  time  consuming.  Bayes'  rule  will 
come  in  handy  here,  but  we  still  need  some  more 
information;  suffice  it  to  say  that  we  should  be 
aware  of  prior  probabilities  P( group  1). 

We  are  still  not  at  the  point  where  we  can 
determine  the  optimal  solution  (equation). 
Consideration  of  additional  information  available 
for  each  case  will  help  us.  To  take  advantage  of 
the  additional  information  available,  we  need  to 
assess  the  likelihood  of  the  additional 
information  under  different  circumstances. 

For  example,  if  the  discriminant  function 
scores  are  normally  distributed  for  each  of  two 
groups,  and  the  parameters  of  the  distribution 
can  be  estimated,  it  is  possible  to  calculate  the 
probability  of  obtaining  a  different  discriminant 
function  value  if  the  case  is  a  member  of  group 
one  or  group  two. 

This  probability  is  called  the  conditional 
probability  of  the  discriminant  score  (D),  given 
the  group,  P(D/G(i)).  To  calculate  the 
probability,  the  case  is  assumed  to  belong  to  a 
particular  group,  and  the  probability  of  an 
observed  score  given  membership  in  the  group  is 
estimated  using  the  normal  distribution. 

Finally,  this  information  about  group 
membership  and  the  conditional  probability  of 
obtaining  a  discriminant  score  given  a  certain 
group  membership,  can  now  be  combined  using 
Bayes'  rule;  this  will  help  us  to  determine  what 
we  were  interested  in  all  along  -  namely,  how 
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likely  membership  in  the  various  groups  is,  given 
the  available  Information  -  referred  to  as  the 
posterior  probability. 

To  demonstrate  the  usefulness  of  this 
procedure,  we  can  look  at  the  following  example: 
we  wish  to  determine  the  financial  measures  that 
are  useful  for  predicting  the  financial  health  of 
a  hospital.  The  research  design  is  now  detailed: 

(1)  Financial  data  was  collected  on  48  New  York 
State  hospitals  over  a  three  year  period 
1980-82. 

(2)  These  hospitals  were  selected  based  upon  a 
New  York  State  management  group's  opinion  as 
to  the  most  fiscally  sound  and  the  most 
fiscally  distressed  hospitals  in  the  state. 
Their  decisions  did  not  result  from 
consideration  of  the  proposed  explanatory 
financial  variables. 

(3)  The  three  year  average  (mean!  and  three  year 
variation  (standard  deviation)  was  calculated 
for  each  of  the  72  possible  explanatory 
variables.  The  purpose  of  these  calculations 
is  to  obtain  measures  that  indicate  trends 
early  enough  to  do  something  about  them,  l.c. 
hospital  management  has  the  opportunity  to 
make  changes  to  improve  the  fiscal  health  of 
the  lnstitu*'lon. 

(4)  The  most  statistically  significant 
explanatory  variables  weiv.  identified  and  the 
discriminant  scores  were  calculated  for  each 
hospital;  che  "jacknlfe"  procedure  was  used 


to  determine  the  classification  of  the  sample 
hospitals. 

At  this  stage,  we  have  an  equation  that  can  be 
used  for  predicting  the  financial  classlf ication 
of  a  hospital  -  fiscally  stable  or  fiscally 
distressed. 

Additional  utility  can  be  obtained  from  this 
equation  by  considering  the  approximate 
probability  of  group  membership  in  the 
population.  In  this  New  York  State  hospital 
example,  70%  of  the  hospitals  in  the  state  are 
fiscally  distressed  and  30%  are  fiscally  sound. 
Based  upon  these  probabilities  (group  membership 
in  the  population)  and  the  additional  information 
obtained  from  the  calculated  discriminant  score 
for  each  case  and  its  known  group,  the 
probability  of  each  case  belonging  to  a 
particular  group  given  its  calculated 
discriminant  score,  P(G(i)/D),  is  obtained.  It 
is  this  probability  that  adds  to,  and 
supplements,  the  normal  discriminant  analysis 
results. 

In  summary,  this  additional  information  will 
help  a  hospital  administrator  determine  not  only 
whether  they  are  in  a  fiscally  unsound  condition, 
but  also  the  severity  of  that  condition. 
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[JNBIASED  ESTIMATES  OF  MULTIVARIATE  GENERAL  MOMENT  FUNCTIONS 
OF  THE  POPULATION  AND  APPLICATION  TO  SAMPLING 
WITHOUT  REPLACEMENT  FROM  A  FINITE  POPULATION 


Nabih  N.  Mikhail, 

ibitract 

Dibiaiad  eatiiatea  of  tbe  loltiiariate  leoeral  loieit  faoetioaa  of 
the  population  are  obtained  then  aanplinf  Iron  finite  popnlationn. 
Pnrtitionn  and  poaer  nuns  are  featured.  Dnbianed  entinatei  of  nulti- 
aariate  cunulanta  and  nonent  functions  are  obtained  an  eiaaples  of 
application. 

1.  Introduction 

Tbe  (eneral  noneat  function  of  tbe  finite  population  ((afp)  can  be 
nritten  in  terns  of  noier  suns  associated  sitb  tbe  partitions  inrolaed. 
Ke  keep  the  coefficients  of  such  poser  suns  4oite  leneral  so  that  tbe 
results  are  applicable  to  a  Tarletj  of  functions. 

be  treat  the  nultirariate  lafp  in  this  paper.  Dniaariate  results 
are  obtained  b;  neans  of  coalescini.  Tbe  paper  folloas  certain  ideas 
and  results  fiaen  in  Ssper,  likbail  and  Tracy  (19T8). 

By  glainf  nore  specific  aaluea  to  these  coefficients,  te  can  obtain 
results  for  nultiaarlate  cniulants  (likbail  and  lalik,  19TI),  for 
nultiaariate  nonent  functions  (Bayer,  1937,  p.40;  1938,  p.42;  likbail 
and  lalik,  1978)  etc.  and  their  unbiased  estinates. 

Tbe  purpose  here  is  United  to  tbe  deriaations  of  tbe  functional 
forts  of  tbe  |nfp  of  aarious  seiibta,  and  their  unbiased  estinates 
tbrouib  selibt  4  are  obtained.  Tbe  special  cases  for  cniulants  and 
nonent  functions  are  obtained  as  applications  of  tbe  theory. 

honest  functions  of  finite  population  bare  in  connon  tbe  property 
that  they  nay  be  eipressed  in  terns  of  noser  anna  (Diyer,193l,  p.l04). 

Poser  suns  la  this  paper  are  denoted  by  ()  for  tbe  saiple,  and  by 
().  for  the  finite  population  of  sine  I. 

'  Tbe  conbiaatorial  coefficient  (Dtyer  and  Tracy,  1964,  p.lt74) 

*1  ‘a 

associated  sitb  a  partition  P  -  p,  ...p  (p.  distinct)  of  oaipar- 

tite  nuiber  p  =  I  Pj  is 


<(P)  = 


(,,11  '..  IVI 'v  -v 


(l.I) 


For  tbe  laltlpartlte  sunber  pq...,  (Tracy  and  Diyer,  1973,  p.4),  if 
partition  V:|p^qj...|  ^  ...jp^qj  ...I  ‘  (ihere|{^  represents  a 
part  of  »  repented  ij  tines)  and  p  I  p^s^,  q  ^  I  q^s^,.... 


<(,v)  - 


Pit: 


(l.T) 


IPr’r  ' 

The  |ifp  in  tens  of  noier  anns  is  F|j|  ^  :  I  <^-2) 

ibere  there  are  r  nultlTariate  units,  D  is  any  partition  of  111.  .1  and 
Ig  is  tbe  coefficient  of  (0)|. 

Then,  if  vis  any  (aoltirariate)  partition  obtained  by  partial 
coalesciif  the  coliins  of  0,  (1.2)  becoKS  tbe  anltlsarlate  fomln 


V2- 


2  d  (-1  R  i-'l 


I 


(1.2' 


fbere  «  rj  *  rj  >  . . .  =  r,  K.’v')  is  tbe  loltisarlate  conblaatoriii 
coefficient  (1.1)  and  B  is  tbe  coefficient  of  (M. 


Tbe  total  coalescini  of  (1.2)  and  (1.2)  leads  to 

l  it  (P)  b,  (P), 

r  g  r  s 


(1.2-) 


Liberty  University 


ibere  P  is  any  partition  of  r.  ^ 

it  are  interested  in  derirlni  tbe  unbiased  estinates  Ig  of  tbe  iifp 

iben  saipliii  litbont  replaceient  froi  a  finite  population  of  site  I. 

Carier  functions  C  are  liien,  oslnf  different  notation  by  Ciner 
P 

(1930,  p.106)  and  Diyer,  likbail  and  Tracy  (1978,  p.  14, IS). 


C  ^  2  (-1)"'*  (s  -  1)!  »(P)  e. 
P  p  a 


ibere 


(1.3) 


*  -  TT 

s  ,(s) 


,  n  -  n(n  -  1)  ...  (n  -  s  ♦  1). 


(1.4) 


Another  set  of  functions  related  to  Carier  functions  Cp,  ilren  by 

Diyer,  likbail  and  Tracy  (1978,  p.l6),  Diyer  and  Tracy  (1980,  p.43S), 
and  likbail  at  al  (1985,  p.2)  follois: 


ibere 


2  (-!)'■'  (s  -  1)!  t  (P)  e* 

P  ' 

.(s) 


(1.3') 


(1.4') 


It  Is  lortbibile  to  nentlon  that  tbe  fonulae  (1.3  )  ud  (1.4  )  ire 
desiined  for  use  in  unbiased  estination  problens. 

2.  Tbe  Analysis  of  Dj-Fnnctlons  iltb  nore  than  one  Subscript 

lere  le  need  tbe  leneral  eipressioi  l|  (/'"ju  ^1  =  nig(ll)  (Diyer. 

I 

likbail  and  Tracy,  1978,  p.l6;  likbail  and  lalik,  1971,  p.72;  Diyer  and 
Tracy,  1980,  p.43S)  ibere  0  la  any  partition  of  tll...l  ud  tbe  parts 
of  0  are  represented  by  tbe  rois.  U^s  defines,  at  least  iipllcitly, 

Dg  as  tbe  coefficient  of  (0)  in  tbe  Ig  nine.  All  speciil  cues  of  the 

sane  leiibt  (Isobarlc)  are  then  obtained  by  coalescini.  All  tbe  0 
under  coislderition  are  those  iblcb  are  partitions  of  111...1  ud 
hence  baie  only  one  T  in  any  coluu. 

DeTlates  should  not  be  nsed  for  tbe  leieral  fornula.  Tbe  deiiate 
fornuls  can  then  be  obtained  by  eliiliatlni  all  partitions  ibleb  ban 
one  or  urn  rois  litb  a  siqle  T  eleunt.  For  eiaiple. 


•I'^ll’ 


Bii(Il)|»Bi,  ("ll 


01 


I 


:(Bi,c;*Bi,c;)(ii).Bioc;(J;) 
01  01  1 

^  'u'“>  ‘  “lo  'll' 


*  ,10, 


Since  I|(Bi,  (;;),)  .  B,,  C^dl)  *  C,  (- 


Slnilarly, 


01  I 


110. 


I.  B,„(lll).  il,,.  (  ).  ♦ 


,011, 


*»011  'loi'r  *191 
100  010  I 

III 


100 


101 

^AIFk'l 


I  'Mir  ■  *1  "iir‘"i  "110  'oori  "loi  'oioi 
oil  010 
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Id  general  this  preceee  lends  to 


*  ('“h 

loroio' 

010 


Nliat  is  needed,  in  general,  is  the  specific  functions  in  tens  of  B's 
and  C's  for  different  The  etaloation  of  D,,,  ,  is  sinple  enough 


“r 


"1U,..1 


since  the  coefficient  of  the  one  roeed  (111...1I  ten  of  the  S,  lalue 
of  I  BglOli  is  !  B||Cy  there  0  is  any  partition  of  111...1  and  a  is  the  . 

nuiber  of  rots  of  0.  Thus, 

“In  =  ®ni“I  *  ®no“I  *  “101“!  ^  “011“!  ^  “loo'I  ■ 

001  OiO  100  010 

001 


and  all  the  partitions  of  111  lith  their  appropriate  B's  and  C's  are 
included.  ^ 

The  eialuation  of  Dg  there  0  consists  of  tio  or  tore  rots  is 
soieihat  tore  coiplei.  Be  note  first  that  coluins  can  be  interchanged 
in  the  problei  and  in  the  result,  so  le  lay  take  the  units  in  the  first 
rot  in  the  first  coluins  folloted  by  the  units  in  the  second  rot  in  the 

second  coluins,  etc.  Thus,  if  te  knot  the  value  of  and  the 

no 

coefficient  of  guj  ,  te  can  interchange  coluin  3  lith  coluin  1  and  get 
ol,,  and  the  coefficient  .  llso,  te  can  get  !>!,,,  and 


"on 

100 


nor  "1011 
10  0100 


“oni  “Iiio’  “loio  ““  “looi  “iioo' 


1000 


0001  0101 


0110 


0011 


obtained  froi  d|,,  d|,;,  etc  are  oj,,,  D|mj, 

,  ,  001  0001  0011  00001 

“lllOO’  “llOOO' 

00011  00110 
00001 

3.  Onbiased  Istiiates  of  the  Bultitariate  iifn 

In  this  section  te  obtain  unbiased  estiiates  of  the  lultirariate 
gifp  then  saipling  litbout  replaceient  froi  a  finite  population,  as 


ll^ni  1  '  BsUg  the  results  of^Dij-fnictionB  in  section  2,  te 

can  derive  unbiased  estiiated  value  of  uy  population  loient  function 
then  saipling  lithout  replaceient  froi  a  finite  population. 

The  fornulae  becoie  quite  coipact  if  te  use  the  integer  notation 
suggested  by  Professor  Diyer  (personal  coiinnications)  tor  partitions, 
la  this  notation,  the  product  of  poter  suis  (110)  (101)  (011)  is 
iritten  as  (12,  13,  23),  there  the  nuiberi  1,  2,  3,  indicate  the  rots 

of  I  101  and  coiias  separate  the  coluins. 

Voii/ 


P,(fl) 


I  IF  ) 

I  11 


^  Dj  (1) 


:  D  (1,1)  ♦  D  (1,2) 
1,1  1,2 


Bhere  the  nuiber  of  rots  in  the  right  is  reduced  to  2  lith  the 
replaceient  of  1  by  the  one  in  the  second  coluin  by  the  nuiber  of  the 
rots,  te  continue  lith  the  replaceient  of  each  1  by  the  nuiber  of  rot. 
Bence,  te  get 

bJ(Fi1i)  ^  ♦  bJ, 21112)  +  BjjjllZl)  ♦  bIijIHZ)  * 

The  sui  of  the  three  liddle  tens  can  be  iritten  as 

“  s  “  I  “  s 

I  l)jjj(112)  (or  I  Djjj(121)  or  I  I)2iil211)  )  . 

there  the  1  applies  to  all  the  different  partitions  resulting  froi 
interchanging  coluins.  Siiilarly, 

4  3  6 

ijlfllll)  --  bJju  dill)  ♦  juliijllin)  V  2I)Ii22dl22)  ♦  JDlijjddO) 

♦  dL,(  1234) 


"1231' 


1 


there  I  Djjj2dll2)  indicate  the  four  different  tens  resulting  froi 
the  unipartitioD  of  1112  and  the  interchange  of  coluin  1  lith  3,  2,  i, 

3  , 

respectively;  there  I  D^22dl22)  indicate  the  three  different  tens 

resulting  froi  the  partition  (1122)  and  the  tio  different  interchanges, 
etc. 

The  partitions  featured  and  the  nuiber  of  tens  in  each  sanation 
are  available  lith  coiplete  coalescing  of  the  partitions.  This  is 
illustrated  by  the  lultinoiial  theorei  in  partition  notation  lith  the 
partitions  iritten  in  coluins: 

!) 

vl/ 

Bare  the  nultipartition  reveals  the  nnibers  of  unit  tens  in  the 
respective  rots,  and  the  coefficient  shots  the  nuiber  of  different  0 
thich  coalesce  to  the  saie  partition.  That  nuiber  is  also  the  coibi- 
natorial  coefficient,  (1.1). 


(1)^  :  (1)  *  4(|j  ♦ 


4.  Iipressions  for  D 


IL 


For  order  2 

»  -  . 
11 

t 


D 


n'l 


'li'! 


"12 


B..C 


12"11 


For  order  3 

1  .  » 

t 

1 

“ill 

^  “  )!i  1*  ' 

“  1?2  { 

t 

.  -1 

t 

“lI2 

t 

'  “ll2“ll 

•  “l23“21 

“123 

'  “l23“lll 

1S3 


S25 


for  order  4 


"im  ■  °ini 


“ini’ll  *  ^  ^  *1122^  *  ^  *1123^  *  ®1234''4 

®1112‘'ll  *  *1123^1  *  ®1213‘'21  *  ®2113‘'21  *  *1234^1 


aC.  ,  +  B, , aaCa.  ♦ 


C»n  + 


“ll22  ■  “n22''ll  ”1123''21  '’23ir21  "12302 

‘’1123  '  *1123^11  *  ®1234'^2U 


for  order  1 


for  order  2 

.»  1 


^  C*  -  L  c‘ 
I  "^1  ,2  2 


for  order  3 


"1234  ■  "I234''lin 
for  order  5 

”11111  '  ®lllll”l  * 
15 


”llll2”2  *  ^  ”lll22”2  *  ^  ”lll23”3 


t  t  * 

*  ^  ®11223S  *  ^  ®11234”4  *  ”l2346’'6 


”11112  '  ”llll2”ll  *  “lins'n  *  ”ll213”21  *  ”l2113”21  *  ”21U3”21 

*  ”ll223”21  ^  ”l2123”21  *  ®21123”21  *  ”ll232”21 
’  ”ll234Sl  *  ”l2134”31  *  ®21314”31  *  ”23114”31 

*  ”l2314‘'31  *  ”l2345‘'41 

”11122  '  ®11122”!i  *  ”ll322”21  *  ”l3122”21  *  ”31122”21  '  ”ll234”22 

*  ”l2134S2  *  “21134S2  *  ”l3422”31  *  ”l2345”3 

2 

”U123  ■■  *11123^111  '  ®11234”211  '  ”l2l34”211  '  ”21134”21  '  ”l2345”311 
”11223  '  ”ll223‘'lll  *  ”ll234”211  *  ®23114''211  *  ”l2345‘'221 
”11234  ■■  ”ll234”llll  *  ”l2345''2111 

”12345  '  ”l2345”lllll 
etc. 

5.  for  finite  fereion  of  Boient  functione  ^ 

4e  an  applicatioo  of  unbiaaed  eetiiatea  of  gifp,  the  unbiased  esti- 
■atee  of  the  finite  rersion  of  the  loient  functions  ^  to 

saiplio;  sithout  replaceient  for  a  finite  population  (libbail  and 
Balik,  1978)  is  considered  in  this  section. 

Here  the  coefficients  B.  for  0  :  111...1  is 

I  " 

=1  if  D  :  lll.,,l  in  one  ron 


0  :  111. ..10000 
000... 01000 


000... 00001 


|-U“''ls-1) 


othernise 


:  10. ..0 
010. .0 


>itb  111  .1 
in  the  first 
roi  and  onlp 
one  T  in 
each  ro«  of 
the  (s-l)  roes 


for  order  4 


D*  -  C*  -  C* 
”1123  ■  ,3  ”1  ,4  ''2 

'  1  ’  1 


.  -1  » 


for  order  5 

.»  .  1  .»  ,  Jl  »»  ,  in  p»  .  -i  p»  ,  -1  r* 

”11111  ■  I  ”1  ■  ”  ,2  ”2  *  ,3  ”3  ,4  ”4  ,5  ”5 

„»  .  il  .»  4  »»  ,  J  r‘  .  JL  p‘ 

”11112  ',2  ”1  ,3  ”2  ,4  ”3  ,5  ”4 

'  1  '  1  '  1  '  1 

.»  .  _I  -»  J  c*  -  C*  a  C‘ 

•imi  ■  ,1  ‘1  h  ,s  '1 


.«  .  J  p»  .  J  ,  -1  c‘ 

”11123  ■  ,3  "^l  ,4  *^2  ,5  '■I 

'  1  *  1  '  1 


C*  *  -*  C* 

4  ”2  ,5  ”2 

'  1  '  2 


Then  for  the  partition  0  and  u  in  C  nritten  a  coluin,  »e  hare 

u 
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D*  -  c‘ 

"12345  ■  ,5  "l 

"  1 

1 

1 

1 


6.  Dj  tor  ?iaite  Veraion  of  CuittUats  <  m  j 

Dnbiased  estiiates  of  fisite  Tertioo  of  the  ciiiulantB  are  cooaidered 
here  aa  another  application  of  nnhianed  entiiatea  of  the  gifp  for 
aaipilng  litbont  replaceient  for  finite  population  (Bikhail  and  laiik, 
1978). 

Here  the  coefficient  Bg  for  0  ;  111...1  ia 

Bg  :  (-1)*  *(a  -  1)!/h’  if  the  nuiber  of  roaa  ia  i. 

Since  all  loaent  functiona  are  identical  for  order  I,  2,  and  3,  ae  can 
aaj  that  unhiaaed  eatiiatea  of  loient  functiona  of  the  population  are 
identical  for  order  p  i  3.  Here  ae  atart  for  cniulanta  of  order  p  h  4. 

for  order  4 

"nil  ■  H  n  g2  h  g2  ‘'2  g3  1  ,4  M 


d‘  -  ^  c* 

12345  ■  ,5  1 

"  1 
1 
I 
1 

7.  Aoplicationa  and  Suiiarr 

In  thin  aection  eiaiples  of  the  unhiaaed  eatiaatea  of  the  triaariate 

I  /^  \  t 

loient  function  Bg  a  ^1  j  and  cuiulanta  Bg< 
plicationa  to  the  unhiaaed  eatiaatea  of  the  lultirariate  gifp 


are  obtained  ae  ap- 


■H(il 


[in  aectiona  5  and  6  uaing  the  Dg-functiona  in  aectiona  2  and 


3.  For  eianple, 
'1' 
for  a|  1 


c;)m(112)M121) 


_i  n*  Jl 

'=2 

I  1  s  1 
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.4  ''3 


g»  .  li  ,  _1  g»  .  d 

"i122  ■  ,2  n  „3  ''2  ,4  ^2 

'  1  '  1  '  2 


0*  ■  c* 

"ll23  ■  ,3  n 


c* 

1  ’  1 


"1234 


1 


c‘ 

4  n 
"  1 

1 


i 


for  order  5 

.  1  g»  .  ^  gMl  p»  ,  21  g‘  ,  11 

"lllir  H  n  ,2  ^2  ,2  ‘'2  j3  h  g3  ’'3 

g»  .  ,  ^  g»  ,  12  J  11  ,  2i 

‘’llin  ■  ,2  n  ,3  *^2  ,3  '^2  ,4  ‘'3  ,5  S 

'  1  ’  1  '  I  ’  1  ’  1 

„»  .  .  J.  P«  ,  J  ,  11  Ji  g»  ,  21  g‘ 

11122'  ,2  n  ,3  ^2  ,4  ''2  ,4  ''3  ,5  ''3 

"  1  ’  1  "  2  "  1  2 


^  cS  ^  c’ 


«“"(j  '1'^  'i)  '1'“'  .1  'i 


for  I  I 


'Hi) 


(111)  (  J  C*  -  ^  C*  ♦  ^  c’jt  (  (112)  ♦  (121)  * 


c!  t  ^  ♦  c;  I  4  (123)  -!r  c 


(211)  )  (  ,2  "1  \3 


-I  c* 

.3  H 


For  the  triaariate  lonent  function  ae  haae 


*»'^ll’ '  ^11*"'* '  '’iio^oi'*  '’loi^io**  ’’oi/ioo^*  ^ 


100, 


001  010 

For  the  biaariate  cane  ae  get 

BglFfi)  -■  Djj(21)  t  t^glgj)  ♦  ..jjl  jg  , 


100 


10 


010 

001 


oor 


i;(F„):D:,(21)aI);g(j;)a2D;j(;j)»D;g(!0)  , 
01  10  10 


01 


and  for  the  uniaariate  caae  ae  haae 


g»  .  -2  p«  .  12  p\  21  p  • 
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'  1  '  1  "  1 
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D*  -  C* 
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’  1 
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^  C*  a  ^  C  ‘ 
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B;(F3).D;(3)a3D;(|)aD; 

1  1 
1 

In  general  thin  paper  giaea  the  functional  foria  of  the  inltiaariate 
gifp  and  ita  unhiaaed  eatiiatea  in  a  aer;  coipact  fora.  The  reaulta 
are  applied  to  obtain  lultiaariate  unbiased  eatiaatea  of  aoaent 
functiona  I  and  cuiulanta 
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Abstract 

The  two-terminal  reliability  problem  for  an  undirected  network 
involves  calculating  the  probability  that  two  distinguished  sites 
are  connected  by  a  path  of  working  edges.  This  problem  is 
known  to  be  NP-hard,  even  for  the  special  case  of  planar 
systems.  We  describe  efficient  data  structures  and  algorithms 
for  calculating  the  two-terminal  reliability  for  planar  networks 
in  pseudopolynomial  time;  that  is,  the  time  complexity  is 
polynomially  bounded  in  the  number  of  paths  (or  the  number 
of  cutsets).  Computational  experience  with  the  algorithms  is 
also  presented. 

1.  Introduction 

The  study  of  the  reliability  of  complex  systems  has 
interested  mathematicians,  statisticians,  electrical  engineers, 
and  computer  scientists  among  others  (Barlow  and  Proschan, 
1975).  Its  applications  include  such  diverse  areas  as 
communication  and  transportation  systems,  electrical 
networks,  quality  control,  computer  design,  and  software 
validation.  In  particular,  network  reliability  addresses  the 
synthesis  and  analysis  of  systems  that  can  be  modeled  using  a 
network  of  vertices  and  edges.  The  availability  of  either  two- 
way  or  one-way  communication  is  reflected  through  the  use  of 
undirected  or  directed  networks. 

The  models  used  in  network  reliability  are  of  two  types: 
deterministic  and  stochastic.  In  deterministic  network  models, 
a  fixed  network  is  subject  to  attack  by  an  intelligent  adversary. 
Typical  reliability  measures  used  are  the  connectivity, 
cohesiveness,  and  diameter  of  the  underlying  graph.  Such 
measures  lend  to  incorporate  a  worst-case  point  of  view,  by 
concentrating  on  the  maximum  disruption  that  could  be 
inflicted  on  the  system. 

In  stochastic  network  reliability  (the  focus  of  study  here), 
the  network  components  are  subject  to  failure  according  to 
some  probability  model.  Typically,  the  system  under 
consideration  is  treated  as  a  network  with  reliable  (or  perfect) 
vertices  and  unreliable  edges  that  may  assume  one  of  two 
states:  operational  or  failed.  The  edges  are  assumed  to  fail 
independently,  with  probabilities  that  are  known  and  constant 
over  time. 

The  average  behavior  of  such  a  system  can  be  studied  using 
a  variety  of  probabilistic  measures  that  quantify  the 
“connectedness”  of  the  underlying  graph  resulting  from  edge 
failures.  In  undirected  networks  G,  the  all-terminal  reliability 
RIG)  refers  to  the  probability  that  all  vertices  remain 
connected.  The  two-terminal  reliability  Rjji(G)  is  the 
probability  that  two  specified  vertices  s  and  t  are  connected 
using  operational  edges  of  the  graph.  An  alternative  measure 
is  the  expected  number  of  vertex  pairs  able  to  communicate. 
This  paper  will  be  primarily  concerned  with  the  two-tenninal 
reliability  of  a  network  in  which  each  edge  i  is  assumed  to 
operate  with  probability  pj. 

TTie  most  f^undamental  method  of  calculating  R^i(G)  u.ses 
state-space  enumeration  and  dates  back  to  Moore  and  Shannon 
(1956).  The  state  of  the  network  can  be  represented  with  a  0- 1 
vector  5  =  15],  §2.  .  .  ..  5^1  whose  i-th  component  is  1  if  edge  i 
is  operating  and  0  otherwise.  The  probability  of  a  given  state 
5  is  then  given  by 

m 

n5  1  ft 

i-i 


Suppose  ©  is  the  set  of  all  states  and  the  variable  Iy(5)  equals 
1  if  the  subgraph  of  operational  edges  indicated  by  o  contains 
an  s-t path  and  0  otherwise.  An  s-t  path  is  simply  a  minimal 
set  of  edges  whose  functioning  ensures  that  s  and  t  are 
connected.  Then  the  s-t  reliability  is  given  by 

Rst(G)  =  X  Pr(5)  • 

6€© 

Although  conceptually  simple,  the  state-space  approach  is 
impractical  because  I2>l  =  2'".  Improvements  to  this  approach 

can  be  made  by  focusing  directly  on  the  s-t  paths  {Pj,  P2 . 

P|;)  of  G.  Define  Ej  to  be  the  event  that  all  edges  in  path  Pj 
operate.  Then 

Rsi(G)  =  PrfE]  u  E2U  •  ■  •  u  E|j]  .  (1.1) 

The  probability  of  each  event  Ej  is  easy  to  calculate  by  the 
independence  assumption: 

Pr|E,|  -  n  Pj  . 

J£P 

The  evaluation  of  (1.1),  however,  typically  requires  complex 
calculations  because  the  events  are  not,  in  general,  mutually 
disjoint.  For  example,  this  equation  can  be  expanded  using 
the  inclusion-exclusion  formula,  but  there  are  an  exponential 
number  of  terms  (2*'- 1)  to  be  considered.  Thus  this  method  of 
calculating  Rst(G)  is  exponential  in  k.  This  exponential 
behavior  is  not  surprising  in  view  of  the  fact  that  calculation  of 
Rs,(G)  is  a  mathematically  difficult  problem.  Namely,  this 
problem  belongs  to  the  class  of  NP-hard  problems  (Rosenthal, 
1977)  and  thus  there  is  unlikely  to  exist  any  efficient  (i.e. 
polynomial-time)  solution  procedure. 

Here  we  address  the  s-t  reliability  problem  for  the  special, 
but  imponant,  case  of  planar  graphs.  It  is  known  that 
calculating  RjiIG)  is  still  an  NP-hard  problem  for  planar 
graphs  (Provan,  1986).  Our  concern  then  is  in  providing  for 
efficient  enumeration  of  cenain  combinatorial  objects  (s-t  paths 
and  s-t  cutsets)  in  planar  graphs,  which  can  then  be  used  to 
calculate  the  s-t  reliability  in  psendopolynomial  time:  i.e.  the 
work  involved  is  polynomially  bounded  in  the  number  of  such 
objects,  although  it  still  can  grow  exponentially  with  the  size 
of  the  graph.  Throughout,  various  discrete  structures  will  be 
developed  both  as  data  structures  and  as  theoretical 
frameworks  to  implement  this  approach.  Section  2  discusses  a 
compact  representation  for  planar  graphs,  and  the  next  two 
sections  present  efficient  algorithms  for  generating  s-t  cutsets 
and  s-t  paths  in  planar  graphs.  Section  5  shows  how  these 
generated  objects  can  then  be  used  to  calculate  network 
reliability  for  such  networks. 

2.  Renresentation  of  Planar  Graphs 

An  undirected  graph  G  =  ( V,  E)  consists  of  a  finite  set  of 
vertices  and  a  set  E  of  edges  whose  elements  are  unordered 
pairs  of  vertices.  The  edge  e  =  (u.v)  e  E  is  said  to  be  incident 
w  ith  u  and  v,  and  the  vertices  u  and  v  are  the  end  points  of  e. 
Two  vertices  u  and  v  for  which  (u,v)  e  E  are  called  adjacent. 
The  set  of  vertices  adjacent  to  v  is  written  A(v),  with  the 
degree  of  v  defined  as  IA(v)l,  Throughout,  we  will  re.sers'e  n 
for  IVI  and  m  for  lEI. 

In  particular,  we  will  be  concerned  with  plantu"  graphs.  An 
undirected  graph  k  planar  if  it  can  be  embedded  in  the  plane  so 
that  edges  intersect  only  at  a  vertex  with  which  they  arc  botli 
incident.  Given  an  embedding  t'f  G  in  the  plane,  a  region  of  G 
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is  a  maximal  connected  portion  of  the  plane  which  does  not 
contain  elements  of  G.  Every  embedding  of  G  in  the  plane  has 
one  infinite  region  called  the  exterior  region.  A  planar  graph 
will  in  general  have  many  different  plane  embeddings,  though 
the  total  number  of  regions  rQ  will  always  equal  m  -  n  +  2. 
Figure  2.1  provides  an  example  of  a  graph  and  one  particular 
embedding  of  it  in  the  plane;  the  regions  q  have  also  been 
indicated  with  the  exterior  region  called  rj 


Figure  2.1  A  planar  graph  with  regions  rj  through  rg 

Let  G  be  a  planar  ffaph  with  a  fixed  plane  embedding.  A 
dual  of  G,  denoted  CP,  is  a  graph  formed  by  associating  a 
vertex  of  G^  with  each  region  of  G  and  then  joining  two 
vertices  of  G^  for  each  edge  of  G  common  to  the  boundaries 
of  the  two  associated  regions  of  G.  (See  Figure  2.2.)  When  it 
is  clear  from  the  context,  a  dual  relative  to  a  specific 
embedding  of  G  will  be  referred  to  as  the  dual  of  G.  The  dual 
graph  G^  is  also  a  planar  graph.  Moreover,  the  regions  of  G 
are  in  one-to-one  correspondence  with  the  vertices  of  G'^, 
vertices  of  G  with  the  regions  of  G^,  and  edges  of  G  with  the 
edges  of  G^.  Note  that  an  embedding  of  G^  is  determined  by 
the  chosen  embedding  of  G. 


Figure  2.2  A  graph  G  and  its  dual  G^ 

Subsequently,  it  will  be  necessary  to  identify  the  regions 
and  edges  “around”  a  given  vertex  v,  relative  to  a  fixed 
embedding  of  G.  A  region  is  incident  with  v  whenever  v  is  on 
the  boundary  of  that  region.  The  regions  and  edges  incident 
with  V  can  be  placed  in  an  ordered  (circular)  list:  tq,  Cq,  fj, 

C],  -  I'd.  1,  6(1.1,  rg  where  d  is  the  degree  of  v  and  region  q  is 
bordered  by  e,.]  and  e,  (all  subscripts  being  taken  modulo  d). 
Such  an  ordered  list  will  reflect  either  a  clockwise  (CW)  or 
counterclockwise  (CCW)  traversal  of  the  regions  and  edges 
incident  with  v.  Similarly,  a  CW  or  CCW  orientation  can  be 
given  to  a  region  r,  inducing  an  ordered  circular  list  of  the 
vertices  and  edges  on  its  boundary:  vg,  eg,  V|,  ei, ....  V|(.], 
^k-l’  '^O-  ^i-l  vertices  Vj.|  and  Vj  (modulo  k) 

‘'"r  k,  k  d?"''Mng  the  size  of  the  boundary  of  r 

(equivalently  the  degree  of  the  dual  vertex  corresponding  to  r 
in  the  dual  embedding). 

Planar  graphs  can  be  encoded  in  a  compact  way  for  use  in 
reliability  analysis  (Whited,  1986).  In  particular,  the  data 
structure  allows  easy  access  to  (a)  an  ordered  list  of  the  edges 
and  regions  incident  with  a  given  vertex  v;  (b)an  ordered  list 
of  the  edges  and  vertices  on  the  boundary  of  a  given  region  r; 
(c)  the  two  regions  bordered  by  each  edge;  and  (d)  the 
corresponding  representation  for  the  dual  graph.  Linear  time 


algorithms  to  cany  out  each  of  these  four  tasks  can  be 
implemented  using  such  a  data  structure. 

з.  Enumeration  of  s-t  Cutsets  in  Planar  Graphs 

A  fundamental  notion  in  reliability  calculations  is  that  of  a 
path:  a  minimal  set  of  components  whose  operation  ensures 
that  the  system  operates.  Another  important  concept  is  that  of 
a  cutset:  a  minimal  set  of  components  whose  failure  ensures 
that  the  system  must  fail.  We  first  describe  such  concepts  in 
the  context  of  graphs,  and  then  discuss  methods  for 
enumerating  these  objects.  For  planar  graphs,  certain  “local” 
information  can  be  exploited  to  provide  an  improved  cutset 
enumeration  algorithm.  Section  4  presents  similar  methods  for 
the  enumeration  of  paths.  First,  we  establish  some  needed 
notation. 

Iji  a  graph  G  =  (V,  E),  the  complement  of  X  C  'V  is  denoted 
by  X  =  V  -  X.  The  open  neighborhood  r(X)  of  X  is  defined 
by  r(X)  =  (v  e  XI  (u,v)  e  E  for  some  u  e  X).  The  induced 
subgraph  (X)  is  the  graph  H  =  (X,  F)  where  F  =  { (u,v)  e  E  I 

и, v  £  X) . 

An  alternating  sequence  u  =  vg,  (vg.vj),  V],  ....  vjj.], 
(vk.],V|(),  V|(  =  V  of  distinct  vertices  and  edges  is  called  a  u-v 
path.  If  a  u-v  path  exists  in  G  between  all  vertices  u  and  v, 
then  G  is  connected.  Otherwise,  G  decomposes  into  a  number 
of  connected  components.  A  vertex  v  of  a  connected  graph  is 
called  a  cut  vertex  if  the  graph  G  -  v  =  (V-v)  is  not  connected. 
A  minimal  set  of  edges  whose  removal  from  G  leaves  s  and  t 
in  different  connected  components  is  an  s-t  cutset. 

If  X  c  V  with  s  e  X  and  t  e  X,  then  (X,X)  denotesjhe  set 
of  edges  in  E  with  one  end  point  in  X  an^the  other  in  X. 

Note  that  the  removal  of  the  edges  in  (X,X)  separates  vertex  s 
from  vertex  t.  If  both  induced  subgraphs  (X)  and  (X)  are 
connected,  then  it  is  known  that  (X,X)  is  an  s-t  cutset 
(Bellmore  and  Jensen,  1970).  For  this  reason,  such  a  set  X 
(with  (X)  and  (X)  being  connected)  will  be  called  a  connected 
s-t  separating  set. 

The  most  efficient  algorithm  for  enumerating  all  s-t  cutsets 
in  an  undirected  graph  G  is  the  procedure  of  Tsuidyama,  et  al. 
(1980).  Its  worst-case  time  complexity  is  given  by 
0((n-i-m)Cs,),  where  Cjt  is  the  number  of  s-t  cutsets  in  G.  This 
algorithm  relies  on  two  established  facts: 

(1)  There  is  a  one-to-one  correspondence  between  s-t 
cutsets  and  connected  s-t  separating  setS;^ 

(2)  Let  X  c  Y  c  V.  If  both  (X,X>  and  (Y,  Y)  are  s-t 
cutsets,  then  there  exists  a  v  e  Y  -  X  so  that 
(X+v,X-v>  is  an  s-t  cutset.  Such  a  set  X-t-v  is  called  a 
I -point  extension  of  X. 

In  view  of  (1),  it  is  only  necessary  to  enumerate  connected  s-t 
separating  sets.  The  second  fact  then  guarantees  that  all 
separating  sets  can  be  generated  by  considering  only  I -point 
extensions  of  separating  sets.  This  leads  to  the  algorithm  of 
Tsukiyama,  in  which  each  separating  set  is  recursively 
processed  to  find  its  1 -point  extensions. 

(Any  edge  not  on  some  s-t  path  is  temied  irrelevant.  The 
presence  of  irrelevant  edges  may  invalidate  the  fact  that  only  I- 
point  extensions  are  required  to  find  all  connected  s-i 
separating  sets.  For  this  reason,  it  will  be  supposed 
throughout  that  G  is  a  graph  without  irrelevant  edges.  This 
condition  can  be  efficiently  checked,  using  an  algorithm  of 
Hopcroft  and  Tarjan  (1973).) 

To  see  how  the  processjng  works,  let  X  be  a  connected  s-t 
separating  .set  and  let  v  e  X.  The  conditions  for  X+v  to  be  a 
connected  s-t  separating  set  are  then: 

(a)  v?tt, 

(b)  <X-t-v>  must  be  connected  (so  v  e  r(X)), 

(c)  <X-v>  mustjtlso  be  connected  (so  v  cannot  be  a  cut 
vertex  of  (X)). 

The  crucial  step  in  the  Tsukiyama  algorithm  is  determining  the 
set  W(X)  comprised  of  all  vertices  v  for  which  X-t-v  is  a  1  - 
point  extension  of  X.  The  three  conditions  just  stated  give  the 
following  description: 
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Table  3.1  Comparison  of  Algorithms  on  Four  Test  Examples 


W(X)=  (ve  XI  v;tt,  ve  r(X),  and  v«E  K(X)), 

where  K(X)  is  the  set  of  all  cut  vertices  of  0^. 

It  is  important  to  note  that  determining  K(X),  and  thus 
W(X).  for  each  connected  s-t  separating  set  X  is  the  most  time- 
consuming  aspect  of  the  algorithm.  A  goal  of  this  section  is  to 
develop  an  rfficient  way  to  determine  whether  or  not  a  given 
vertex  vs  X  is  in  K{X)  for  the  case  when  G  is  a  planar 
graph. 

Suppose  then  that  X  is  a  connected  s-t  separatingret  of  the 
planar  graph  G,  and  let  v  be  an  elemenutf  r(X)  £  X, 
Normally,  determining  whether  v  s  K(X)  is  a  “global” 
operation  in  the  sense  that  all  of  (K)  must  be  examined  (e.g. 
by  a  depth-first  search)  to  decide  whether  or  not  v  is  a  cut 
vertex.  However,  for  planar  graphs  a  “local”  check  suffices, 
involving  only  the  relationships  among  the  edges  and  regions 
surrounding  v.  We  will  use  Cx  to  denote  (X,X)  and  Rx  to 
denote  those  regions  of  G  which  have  some  edge  of  Cx  on 
their  boundary.  Also,  as  in  Section  2,  the  ordered  circular  list 
of  regions  and  edges  around  v  will  be  denoted  by  A^;  rg,  eg, 
ri,ei, ...,  rj.i,  ej.i,  rg.  Knowledge  of  which  edges  and 
regions  of  Ay  are  in  Cx  and  Rx,  respectively,  will  provide  the 
“local”  check  indicating  whether  or  not  v  is  a  cut  vertex  of 
(X>. 

Note  that  v  e  FfX)  im^ies  that  at  least  one  edge  of  Ay  is  in 
Cx;  the  connectivity  of  0^  implies  that  at  least  one  edge  of  Ay 
is  not  in  Cx.  Denote  by  Iy(X)  the  subset  of  Ay  containing  the 
edges  and  regions  in  Cx  and  Rx.  We  define  ly(X)  to  be  a 
contiguous  subset  of  Ay  if  for  some  j  and  k  (modulo  d) 

Iv(X)=  {fj.  Cj,  rj+l,  ej+i . tk- ^k- ''k+ll-  Examples  of 

contiguous  and  non-contiguous  subsets  Iy(X)  of  Ay  are  shown 
in  Figure  3.1.  The  following  result  (Whited,  1986) 
establishes  the  key  relationship  to  W(X). 

Theorem  3.1  Let  G  be  an  undirected  planar  graph  with  a 
fixed  plane  embedding  and  let  X  be  a  connected  s-t  separating 
set  of  G.  Then  v  e  r(X)  is  not  a  cut  vertex  of  if  and  only 
if  Iy(X)  forms  a  contiguous  subset  of  Ay. 

The  characterization  in  TTieorem  3. 1  yiejds  a  simple,  local 
check  to  identify  whether  or  not  v  is  in  K(X),  thus  simplifying 
the  determination  of  W(X)  for  planar  graphs.  In  addition,  the 
planarity  of  G  can  be  used  to  establish  efficient  updating 
schemes  for  various  sets  needed  to  calculate  W(X).  The 
results  of  this  section  can  be  combined  to  produce  a 
modification  of  the  Tsukiyama  procedure  which  enumerates  all 
s-t  cutsets  in  an  undirected  planar  graph.  The  modified 
algorithm  has  the  same  worst-case  complexity  as  that  of 
Tsukiyama  (1980),  which  is  O(ncsi)  for  planar  graphs. 

Both  the  modified  and  original  Tsukiyama  algorithm  have 
been  empirically  tested  on  various  examples  taken  from  the 
reliability  literature  with  results  shown  in  Table  3.1  (arranged 
roughly  in  order  of  increasing  difficulty).  The  execution  times 
(on  an  IBM  308 1-K  mainframe)  shown  in  the  T|  and  T2 
columns  represent  the  total  time  taken  to  find  all  s-t  cutsets  in  a 
given  graph  after  the  initial  set-up  procedures  have  been 
executed.  TTie  time  required  for  set-up  was  virtually  identical 
for  both  algorithms  and  for  all  four  problems,  and  amounted  to 
approximately  .02  seconds  in  each  instance.  The  results 
indicate  that  the  modified  algorithm  yields  an  improvement  of 
up  to  36%  over  the  Tsukiyama  algorithm  in  these  planar 
graphs. 

Also,  the  two  algorithms  have  been  empirically  compared 
for  (p,q)  grid  graphs,  consisting  of  p  rows  of  q  vertices 
connected  in  a  rectangular  grid.  In  addition,  venices  s  and  t 
are  added,  with  s  adjacent  to  the  p  vertices  in  the  first  column 
of  the  grid  and  t  adjacent  to  the  p  vertices  in  the  last  column  of 
the  grid.  A  (4,3)  grid  graph  is  pictured  in  Figure  3.2. 


Source 

n  m 

^st 

Ti 

T2 

%A 

Abraham  (1979) 

8  12 

12 

.0075 

.0071 

5.3 

Locks  (1979) 

9  18 

72 

.0411 

.0294 

28.5 

Bailey/Kulkami  (1986)  17  25  : 

1721 

1.1518 

.7609 

33.9 

Fishman  (1986) 

20  30  7376 

5.5945 

3.5643 

36.3 

T|  and  Tj  are  the  execution  times  in  seconds  for  the  original 

and  modified  Tsukiyama  algorithms  respectively; 

%A  is  (T,  -  T2)/T,  > 

t  100% 

Iy(X)  contiguous 


Iy(X)  not  contiguous 


—  ^  Inly(X) 

-  Notinly(X) 

O  InX  0  InX 

Figure  3.1  Examples  of  contiguous  and  non-contiguous  lylX) 
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The  grid  graphs  are  of  interest  to  us  for  several  reasons. 
First,  they  exhibit  a  quite  rapid  increase  in  complexity  with 
problem  size.  For  instance,  a  (3,2)  grid  graph  has  only  29  s-t 
cutsets,  a  (4,3)  grid  graph  has  426  s-t  cutsets,  and  a  (5,4)  grid 
graph  has  16,347  s-t  cutsets.  Another  reason  for  examining 
grid  graphs  is  that  the  dual  of  a  grid  graph  is  also  a  grid  graph. 
When  the  two  algorithms  were  run  on  a  variety  of  (p,q)  grid 
graphs,  the  modified  Tsukiyama  algorithm  again  consistently 
outperformed  the  original  procedure,  with  the  percent 
improvement  averaging  21%.  In  the  most  difficult  problems 
(requiring  the  generation  of  over  100,000  cutsets),  the  percent 
improvement  exceeded  40%. 

4.  Enumeration  of  s-t  Paths  in  s-t  Planar  Graphs 

This  section  describes  algorithms  for  enumerating  s-t  paths 
in  s-t  planar  graphs.  Efficient  generation  of  such  paths  is 
crucial  for  carrying  out  the  reliability  computations  of  the  next 
section.  An  s-t  planar  graph  is  one  which  can  be  embedded  in 
the  plane  with  two  specific  vertices  s  and  t  lying  on  the 
boundary  of  the  exterior  region.  Equivalently,  G  is  an  s-t 
planar  graph  if  G  together  with  the  edge  (s,t)  is  planar.  If  G  is 
an  s-t  planar  graph,  then  embed  the  graph  H  =  G  -i-  (s,t)  in  the 
plane  and  take  its  dual  H^.  The  regions  of  H  which  lie  on 
either  side  of  the  edge  (s,t)  are  identified  with  vertices  s®  and 
t^  of  H^.  The  paph  G*  =  -  (s’^.t*^)  is  then  called  the  s-t 

dual  of  G.  The  imponant  fact  linking  these  two  graphs  is  that 
a  set  of  edges  forms  an  s-t  cutset  (path)  in  G  if  and  only  if  the 
corresponding  set  of  dual  edges  forms  an  s-t  path  (cutset)  in 
G*. 

Thus,  one  way  to  enumerate  the  s-t  paths  of  G  is  to  find  its 
s-t  dual  G*  and  then  enumerate  the  s-t  cutsets  of  G*  using  the 
modified  Tsukiyama  algorithm.  In  fact,  one  can  devise  an 
alternative  approach  for  s-t  path  enumeration  that  works 
directly  on  the  given  graph.  The  key  idea  again  is  to  consider 
only  1 -point  extensions  of  a  given  path,  analogous  to  1 -point 
extensions  of  cutsets  (Section  3). 

In  the  present  case,  a  path  P  of  G  is  associated  with  a  set  of 
regions  Rp,  and  the  successors  of  P  are  determined  by  1 -point 
extensions  Rp+r  of  Rp.  To  define  such  a  set  of  regions  Rp 
associated  with  an  s-t  path  P,  notice  that  =  P  -t-  (s,t)  is  a 
cycle  of  H  =  G  +  (s,t).  By  the  Jordan  Curve  theorem,  the 
regions  of  FI  are  partitioned  into  those  “inside”  and  those 
“outside”  C**.  Let  a  be  the  region  of  H  bounded  by  (s,t)  and 
which  is  inside  C^.  The  set  Rp  consists  of  a  and  all  other 
regions  of  G  which  are  inside  C**;  see  Figure  4. 1 . 


Figure  4.1  Set  Rp  contains  the  regions  inside  the  cycle  P+(s,t) 

Having  defined  Rp,  we  next  wish  to  determine  all  regions  r 
such  that  Rp-t-r  al.so  corresponds  in  this  way  to  an  s-t  path  in 
G.  Such  a  set  Rp+r  is  called  a  1 -point  extension  of  Rp  and 
the  path  determined  by  it  is  called  a  successor  of  P.  The 
following  theorem,  analogous  to  Theorem  3.1,  describes  a 
simple  "local”  condition  required  for  Rp+r  to  be  a  I -point 


extension  of  Rp.  Its  proof  readily  follows  from  the 
identification  of  s-t  paths  of  G  with  s-t  cutsets  of  G*. 

Theorem  4.1  Let  P  be  an  s-t  path  in  the  s-t  planar  graph  G. 
Then,  Rp+r  is  a  1  -point  extension  of  Rp  if  and  only  if  the 
boundary  of  r  which  lies  on  P  forms  a  nontrivial  subpath  of  P. 

Figure  4.2  shows  examples  of  regions  which  satisfy  this 
requirement  and  of  others  which  do  not.  Given  a  path  P,  its 
set  of  associated  regions  Rp,  and  a  region  r«  Rp,  we  can  then 
easily  test  by  this  theorem  whether  or  not  Rp+r  forms  a  1- 
point  extension  of  Rp.  If  Rp+r  is  a  1-point  extension,  then 
the  (new)  path  Q  associated  with  Rp+r  is  easily  derived. 
Namely,  suppose,  as  in  Figure  4.3,  that  P'  is  the  u-w  subpath 
of  P  lying  on  the  boundary  of  r  ,  where  u  precedes  w  on  P. 
Let  Q  be  tlie  other  u-w  path  on  the  boundary  of  r.  Then  Q  is 
formed  from  P  by  replacing  P'  with  Q'.  Clearly,  all  that  is 
required  to  obtain  the  vertices  Vgand  edges  ^  of  Q  from 
those  of  P  is  a  walk  of  the  boundary  of  r  to  identify  P'  and  Q'. 


Figure  4.2  Examples  of  regions  illustrating  Theorem  4. 1 . 
ri  and  r3  satisfy  Theorem  4. 1 ;  r2  and  u  do  not  (subpath  is 
trivial  and  intersection  does  not  form  subpath  respectively) 


A  formal  algorithm  can  be  based  on  this  approach,  in  which 
some  improvements  analogous  to  those  presented  in  Section  3 
are  incorporated.  The  resulting  algorithm  has  a  worst-case 
time  complexity  of  Olnp^,),  where  Pj,  is  the  number  of  s-t 
paths  in  G.  Rather  than  pursuing  this  approach  in  more  detail 
we  turn  to  a  different  approach  that  has  proved  to  be  more 
efficient  in  practice.  This  alternative  approach  to  enumerating 
paths  uses  a  depth-first  search  (DFS)  of  the  graph.  We  now 
consider  how  this  approach  can  be  modified  to  generate  s-t 
paths  P,  together  with  their  associated  regions  Rp,  in  the 
pre.sence  of  planarity.  Although  some  extra  work  is  required 
to  find  these  regions,  they  are  useful  (in  fact,  essential)  for  the 
s-t  reliability  calculations  presented  later. 

The  use  of  a  DFS  to  enumerate  the  s-t  paths  of  a  graph  G 
requires,  for  each  vertex  v  of  G,  the  set  A(v)  =  |  w  6  VI  (v,w) 
e  E)  of  vertices  adjacent  to  v.  Suppose  that  P  is  a  current 
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path  from  s  to  v.  We  wish  to  extend  P  in  all  possible  ways  to 
an  s-t  path.  The  vertices  in  A(v)  are  scanned  and  each 
w  e  A(v)  not  already  on  P  is  used,  in  turn,  to  extend  P  to  a 
longer  path  by  adding  (v,w)  and  w.  The  search  then  proceeds 
from  each  of  these  extensions  in  a  recursive  manner.  If  t  is 
reached  then  a  new  s-t  path  has  been  found.  Such  a 
straightforward  DFS  procedure  usually  performs  well  in 
practice,  but  can  be  inefficient  in  the  worst  case.  Read  and 
Taijan  (1975)  give  an  example  of  a  graph  with  m  edges  and 
relatively  few  paths  for  which  this  algorithm  requires  on  the 
order  of  2'"  steps.  Read  and  Taijan  modified  this  basic  DFS 
approach  so  that  the  recursion  proceeds  only  when  it  will 
definitely  lead  to  a  new  path.  In  essence,  this  is  accomplished 
by  looking  ahead,  before  extending  P  to  w  e  A(v),  to 
determine  whether  or  not  this  extension  leads  to  some  s-t  path. 
This  results  in  an  0(mps()  algorithm  for  a  graph  with  pj;  paths, 
which  is  0(np5i)  for  planar  graphs.  Subsequently  it  will  be 
assumed  that  the  basic  DFS  procedure  has  been  modified  in 
this  fashion. 

To  aid  in  generating  the  region  sets  Rp  at  the  same  time,  it 
will  be  convenient  to  use  a  partial  ordering  on  the  set  of  s-t 
paths.  Namely,  if  P  and  Q  are  two  s-t  paths  such  that  Rp  C 
Rq,  then  P  k  Q  in  the  partial  order.  For  convenience,  s  and  t 
can  be  thought  of  as  being  on  the  boundary  of  the  exterior 
region.  Then  the  relation  P  ^  Q  just  means  that  P  lies 
“above”  Q  (Kulkami  and  Adlakha,  1985);  see  Figure  4.4. 


Figure  4.4  Example  of  two  paths  P  and  Q  with  P  I,  Q; 

P={2,4,6},  Rp={a,e);  Q=(3.8),  RQ={a.b,c,d,e) 

In  carrying  out  the  DFS,  the  order  in  which  paths  are 
enumerated  is  determined  by  the  order  in  which  the  vertices  in 
each  of  the  sets  A(v)  are  processed.  For  a  planar  graph  G,  it  is 
natural  to  order  these  sets  in  a  manner  consistent  with  a  plane 
embedding  of  G.  Specifically,  we  use  a  CW  orientation 
around  each  vertex  v  to  make  A(v)  an  orderal  circular  list 
denoted  wq,  wj,  ...,  Wj.],  wq.  If  v  is  reached  from  w,  in  the 
DFS,  then  the  other  vertices  around  v  will  be  considered  in  the 

order  Wj^.],  Wj+2 . w'j.i  (mod  d). Several  results  follow 

from  this  scheme  (Whited,  1986): 

( 1 )  If  an  s-t  planar  graph  is  embedded  so  that  s  and  t  lie  on 
the  border  of  some  region  a  (say,  the  exterior  region), 
then  the  search  can  be  restricted  each  time  a  vertex 
lying  on  the  boundary  of  a  is  encountered. 

(2)  Tlie  DFS  can  be  performed  so  that  P  is  found  before  Q 
whenever  P  V,  Q. 

(3)  The  regions  Rq  can  be  determined  easily  from  a  know¬ 
ledge  of  the  regions  Rp  for  each  of  the  paths  P  >,  Q, 

The  third  result  is  essential  to  calculation  of  s-t  reliability 
and  hence  is  pre.sented  in  more  detail.  Given  any  two  s-t  paths 
P  and  0.  define  two  other  s-t  paths  as  follows.  The  upper 
path  U(P.Q)  is  determined  by  the  boundary  of  the  set  of 
regions  Rtj(PQ)  =  Rpr^  Rq-  Similarly,  the  lower  path  UP.Q) 
is  determined  Dy  Rl^p  qj  =  Rp  u  Rq,  As  seen  in  Figure  4.5, 
L/(P,Q)  is  form^  by  choosing  the  first  path  in  a  CW  traversal 
whenever  P  and  Q  cross,  and  L(P,Q)  by  choosing  the  last. 
Now  suppose  P  and  Q  are  two  successive  paths  found  by  a 
DFS  algorithm.  The  next  theorem  (Whited,  1986)  shows  that 
U(P,Q)  and  L(P,Q)  can  be  used  to  find  Rq. 


Upper  path  U(P,Q) 


Figure  4.5  Upper  path  U(P,Q)  and  lower  path  L(P,Q) 


Theorem  4.3  Let  P  and  Q  be  two  successive  s-t  paths  found 
by  a  DFS  (node  adjacencies  scanned  CW)  of  an  s-t  planar 
graph  G.  Then  for  some  region  r  of  G,  Rp  =  Rl(p,Q)  ■  f  and 
•^Q=  •^U(P,Q)  +  f- 

The  proof  of  this  result  relies  on  two  facts,  which  can  be 
readily  established.  First,  whenever  some  portion  of  Q  lies 
below  P,  this  portion  can  bound  only  one  region  r.  Second, 
only  one  portion  Q'  of  Q  lies  below  P:  if  Q  again  meets  P  after 
going  below  it,  then  Q  will  remain  on  or  above  P.  Theorem 
4.3  yields  a  convenient  method  for  finding  Rq.  Namely,  first 
walk  P  and  Q  until  they  differ,  identifying  r  as  the  region 
bounded  from  below  by  Q\  Next,  Kq(pq)  is  known  since 
Ru(p,Q)  =  Rp  Rq  ^  Rq  implies  U(P,Q’)  >,  Q  and  so  by  our 
previous  observation  U(P,Q)  is  found  before  Q.  In  this  way, 
Rq  =  Ru(P,Q)  +  f  can  be  determined. 

Together  these  three  results  produce  an  efficient  DFS  path 
enumeration  procedure  for  undirected  s-t  planar  graphs  which 
not  only  finds  all  s-t  paths,  but  also  finds  Rp  for  each  path. 
One  particular  implementation  issue  deals  with  locating 
U(P,Q),  and  thus  Rq(p_q),  among  the  paths  already  generated. 
Since  there  are  often  many  paths  in  even  relatively  small 
graphs,  to  store  all  paths  previously  generated  and  search  the 
entire  set  would  be  expensive  in  both  storage  space  and 
execution  time.  We  can  show  that  at  most  rQ  =  m  -  n  +  2 
paths  need  to  be  stored  at  any  one  rime,  where  r^  indicates  the 
number  of  regions  of  the  planar  graph  G. 

These  paths  P'  (kept  as  a  stack)  are  a  subset  of  those  paths 
satisfying  P’  >r  Q,  each  differing  from  the  previous  by  a  single 
region  in  Rp'.  Either  the  top  of  the  stack  is  U(P,Q)  or  is 
removed  and  will  never  be  needed  again  in  searching  for 
U(P,())  (Whiled,  1986).  These  results  yield  an  O(npsi) 
algorithm  for  generating  all  paths  in  an  s-t  planar  graph. 

5.  Pseudopolvnomial  Algorithms  for  Network  Reliability 

In  this  section,  we  discuss  how  the  enumeration  of  s-t 
cut.sets  and  s-i  paths  can  aid  in  calculating  Rs((G),  the 
probability  that  s  and  t  are  connected  in  a  graph  G  with 
stochastically  failing  edges.  Each  edge  i  of  (j  is  assumed  to 
operate  independently  with  probability  p,.  As  discussed  in 
Section  1,  if  Ej  is  the  event  that  all  edges  in  path  Pj  operate, 
then  the  s-t  reliability  is  given  by 

Rs|(G)  =  PrIE]  u  E2U  u  E^i  .  (5.1) 

In  a  similar  way,  if  F;  is  the  event  that  all  edges  in  cut.sei  Cj 
fail,  then  the  s-t  unreliahility  of  G  is 

U,,,(G)  =  1  -  R.,((G)  =  Pr|F|  u  F2  u  u  FJ  .  (5.2) 

Various  techniques,  such  as  inclusion-exclusion,  for 
calculating  Rs((G)  or  U^((G)  require  (in  the  worst  case)  an 
amount  of  work  that  is  exponential  in  the  number  of  objects 
(paths,  cutsets).  We  would  like  instead  a  method  that  is 
polynomial  in  the  number  of  objects. 

Provan  and  Ball  (1984)  have  shown  how  to  calculate 
Us|(G)  using  a  certain  panial  order  imposed  on  the  s-t  cutsets. 
I'heir  algorithm  is  pseiulopolynomial:  n.amely,  it  has  a  worst- 
case  time  complexity  0(mr‘)  which  is  polynomial  in  r.  the 
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number  of  s-t  cutsets  of  G.  More  generally.  Shier  (1988)  has 
shown  how  the  Provan  and  Ball  method  can  be  generalized 
and  applied  (for  instance)  to  the  calculation  of  Rsi(G)  using  the 
paths  of  s-t  planar  graphs;  also  see  Whited  (1986). 

Let  E  =  {ei, ....  ep,}  be  the  set  of  edges  and  let  S  =  {Sj, 

Sr)  be  a  collection  of  subsets  of  E.  (For  example,  the 
subsets  might  be  s-t  paths  or  s-t  cutsets.)  Each  edge  has  two 
states,  active  and  inactive.  A  set  Sj  ^  E  is  called  active  if  all  its 
components  are  active.  We  suppose  that  the  collection  S 
forms  a  partial  order  having  the  semilattice  property: 
namely,  any  two  Sj,  S;  e  S  have  a  unique  greatest  lower 
bound  Sj  A  Sj.  Two  additional  requirements  are  imposed 
here. 

(1)  Closure:  If  Sj  and  ^  are  active  then  Sj  a  ^  is  active. 

(2)  Convexity:  If  e  e  Sj  and  e  e  Sj  then  e  e  S^  for  any 
S,  s!  Sjr  Sj. 

For  example,  suppose  we  define  the  partial  ordering  >,  on  the 
s-t  paths  of  any  (s,t)-planar  network  as  done  in  Section  4. 
Namely,  S;  V  Sj  if  path  Sj  is  geometrically  “above”  path  Sj. 

In  this  case,  the  greatest  lower  bound  of  S,  and  Sj  is  just 
L(SpSj),  as  defined  earlier. 

As  an  example  of  this  ordering,  consider  the  undirected 
graph  G  in  Figure  5.1.  The  seven  s-t  paths  and  associated 
partial  order  is  depicted  in  Figure  5.2,  where  each  set  Sj  is 
represented  by  a  node  and  the  link  from  Sj  down  to  Sj  in  the 
diagram  represents  the  relation  Sj  >,  Sj.  In  this  representation, 
any  relations  that  can  be  inferred  from  the  transitivity  of  are 
not  explicitly  represented  by  links. 


Figure  5.2  Partial  ordering  of  s-t  paths 

On  the  other  hand,  we  can  consider  the  s-t  cutsets  of  this 
n'aplr  Now  each  such  cutset can  be  represented  as 
Xj,X|),  with  s  G  Xj  and  t  e  X|.  A  natural  partial  ordering  is 


then  Sj  >r  Sj  whenever  Xj  3  Xj.  The  cutsets  and  associated 
partial  order  is  shown  in  Figure  5.3. 


Figure  5.3  Partial  ordering  of  s-t  cutsets 

We  shall  denote  by  Aj  the  event  {Sj  is  active).  Then  our 
reliability  calculations  reduce  to  evaluating  f2(S)  =  F*r(Ai  u  Ai 
u  •  ■  •  u  Ar).  If  we  interpret  “active”  as  meaning  “functioning” 
and  if  Aj  is  the  event  that  path  Sj  is  functioning,  then  Q(S)  is 
simply  Rst(G),  as  seen  from  (5.1).  If  “active”  means  “failed” 
and  Aj  is  the  event  that  all  edges  in  cutset  Sj  have  failed,  then 
f2(S)  is  Usi(G),  from  (5.2).  It  is  to  a  general  algorithm  for 
calculating  f2(S)  that  we  now  turn. 

Because  the  events  Aj  are  not  disjoint,  we  will  instead 
define  events  Lj  that  are  disjoint  by  using  Lj  =  { Sj  is  the 
“lowest”  active  set  in  S ) .  For  example  if  only  edge  4  fails  in 
Figure  5.1,  then  S],  S2,  S3  and  S5  are  the  active  sets  in  Figure 
5.2  (none  contain  ^ge  4)  and  event  Lj  occurs.  If  the  sets 
satisfy  the  closure  and  convexity  properties  stated  earlier,  then 
the  events  Lj  will  indeed  be  a  partition  of  the  space  A,  u  Aj  u 
■  u  Ap  Asa  result,  £2(5)  will  equal  the  sum  Z  Pr(Lj).  As 
shown  by  Shier  (1988),  a  general  recursive  algorithm  can  be 
obtained  that  expresses  Pr(Lj)  in  terms  of  Pr(Aj)  and  earlier 
determined  F*r(Lj)  values: 

Pr(Lj)  =  PriAp  -  Pr(L,)a,j  ,  (5.3) 

s  <  s 

»  j 

where  ajj  =  fl  Fh-|e  is  active:  e  e  Sj  -  Sj). 

There  are  r  equations  represented  in  (5.3),  each  of  which 
involves  at  most  0(r)  terms.  Furthermore,  each  term  requires 
at  most  0(m)  operations  to  be  carried  out.  Thus,  the  worst- 
case  complexity  of  this  method  of  calculating  £2(S)  is  O(mr-). 
In  other  words,  reliability  can  be  calculated  for  planar  graphs 
in  pseudopolynomial  time  when  the  s-t  paths  and  s-t  cut.sets 
can  be  suitably  ordered.  Ahso,  if  every  edge  is  assigned  a 
common  reliability  value  p,  then  the  reliability  (or  unreliability) 
can  be  expressed  as  a  polynomial  in  p  (or  in  q  =  1  -  p)  using 
this  same  method.  The  pseudopolynomial  algorithm  embodied 
in  equation  (5.3)  can  now  be  combined  with  the  previous 
algorithms  for  efficiently  generating  the  s-t  cutsets  in  planar 
graphs  or  the  s-t  paths  in  s-t  planar  graphs. 

This  combined  approach  has  enabled  the  calculation  of 
reliability  for  some  very  challenging  networks  in  the  literature. 
We  summarize  the  computational  results  for  a  number  of  (p.q) 
grid  graphs  in  Table  5,1.  This  table  lists  the  size  of  the  grid 
networks  (n  vertices  and  m  edges),  the  number  of  s-t  cutsets 
(c,,),  and  the  number  of  s-t  paths  (p^).  In  addition,  the  total 
computation  times  on  an  IBM  .3()8!  K  mainframe  are  included 
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for  calculating  reliability  using  the  s-t  cutsets  and  also  using 
the  s-t  paths.  It  should  be  emphasized  that  rather  than  simply  a 
single  numerical  answer,  we  obtain  a  functional  form  for  the 
reliability  polynomial  expressed  in  terms  of  the  common  edge 
reliability  p  (using  the  s-t  paths)  and  for  the  unreliability 
polynomid  as  a  function  of  the  common  edge  failure 
probability  q  (using  the  s-t  cutsets). 

It  is  seen  that  there  can  be  either  more  s-t  cutsets  or  more  s-t 
paths  in  such  graphs,  depending  on  the  grid  graph  parameters. 
Calculation  is  clearly  preferred  using  the  smaller  number  of 
generated  objects,  and  this  justifies  our  emphasis  on  efficient 
generation  of  both  paths  and  cutsets  in  planar  graphs.  Note 
that  the  (3,5)  and  (6,2)  grid  graphs  are  in  fact  duals  of  one 
another,  this  is  manifested  as  the  number  of  cutsets  of  one 
equals  the  number  of  paths  of  the  other.  Also,  the  (4,3)  grid 
is  self-dual:  it  has  the  same  number  of  s-t  paths  as  cutsets. 
Comparison  of  the  associated  CPU  times  for  such  (dual)  grids 
reveals  that  path  generation  is  somewhat  faster  than  cutset 
generation,  other  things  being  equal. 


Table  5.1  Pseudopolynomial  calculation  of  reliability  for  grid 
graphs 


p 

q 

n 

m 

Csl 

Pst 

Ti 

^2 

2 

6 

14 

20 

49 

128 

.124 

.400 

3 

3 

11 

18 

80 

95 

.208 

.245 

3 

4 

14 

23 

195 

313 

1.165 

2.120 

3 

5 

17 

28 

444 

1,033 

5.834 

20.724 

3 

6 

20 

33 

969 

3,411 

28.375 

197.402 

4 

2 

10 

18 

95 

80 

.264 

.196 

4 

3 

14 

25 

426 

426 

4.464 

4.103 

4 

4 

18 

32 

1,745 

2,320 

68.975 

100.724 

5 

2 

12 

23 

313 

195 

2.358 

1.116 

5 

3 

17 

32 

2,320 

1,745 

110.134 

64.005 

6 

2 

14 

28 

1,033 

444 

22.460 

5.623 

T]  is  execution  time  in  seconds,  using  s-t  cutsets;  T2  is 

execution  time  in  seconds,  using  s-t  paths;  Times  do  not 

include  set-up  time  of  approximately  .02  seconds  per  problem 
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ABSTRACT 

Let  an,k,d  be  the  fraction  of  vertices  of  degree  k 
in  a  minimal  spanning  tree  on  a  random  sample 
of  n  vertices  in  d  dimensions.  Steele  et  al.  (1987) 
show  that  as  n  increases  ocn,k,d  converges  with 
probability  one  to  an  unknown  constant  ak,d  for 
any  sampling  distribution  having  a  density  in  3?“^. 
They  perform  a  smeill  scale  simulation  experi¬ 
ment  to  determine  by  esti¬ 
mating  for  increasing  VcJues  ofn  when  ver¬ 

tices  are  distributed  uniformly  in  the  unit  square. 
Here,  we  estimate  {oik.2}  directly  by  systemati¬ 
cally  sampling  the  neighborhood  of  a  particular 
vertex  of  the  Poisson  process  with  constant  inten¬ 
sity  in  2  dimensions.  The  method  ecisily  general¬ 
izes  to  higher  dimensions.  We  discuss  a  variety  of 
algorithms  used  to  improve  the  efficiency  of  the 
sampling  scheme. 

1  INTRODUCTION  AND  SUMMARY 

Let  G  =  (V,  E)  be  a  connected  graph  with  ver¬ 
tex  set  V  =  {t)}  and  edge  set  E  =  {e}.  Let  w{e) 
be  a  reed  number  called  the  length  of  edge  e.  A 
minimal  spanning  tree  T  of  G  is  a  connected  sub¬ 
graph  of  G  with  vertex  set  V  and  edge  set  E'  C  E 
such  that 

e^E' 

is  as  small  as  possible.  From  a  slightly  different 
viewpoint,  a  graph  T  is  a  tree  if  it  is  connected 
and  has  no  circuits.  A  graph  T  is  a  spanning  tree 
of  a  graph  G  if  T  and  G  have  the  same  vertex  set 
and  T  is  a  tree.  A  graph  T  is  a  minimal  spanning 
tree  (MST)  of  a  graph  G  if  it  is  the  “shortest” 
spanning  tree  of  G. 
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The  work  of  Steele  et  al.  (1987)  demonstrated 
that  <Xn,k,dy  the  fraction  of  vertices  of  degree  k  in 
a  minimal  spanning  tree  of  the  complete  graph 
on  n  random  vertices  in  d  dimensions,  converged 
with  probability  one  to  the  fraction  ak,d  inde¬ 
pendent  of  the  sampling  distribution.  We  are 
interested  in  “determining”  the  fractions  a^.d  for 
d—  2.  In  Steele  et  al.  (1987)  there  is  some  spec¬ 
ulation  concerning  the  possibility  that  the  con¬ 
stant  ai,2  =  |.  One  of  our  motivations  for  this 
work  was  to  attempt  to  assess  the  validity  of  that 
speculation. 

Our  approach  differs  from  the  straightforward 
approach  taken  in  Steele  et  al.  '^^1987).  They 
generated  random  samples  of  size  n  from  a  uni¬ 
form  distribution  on  the  unit  square  and  let  n 
range  from  16  to  65536.  For  each  value  of  n  only 
20  minimal  spanning  trees  were  built.  The  to¬ 
tal  number  of  vertices  they  examined  wais  ab^  it 
2.6  X  10®.  Their  approach  suffers  from  two  draw¬ 
backs.  One  drawback  is  the  finite  sample  size  n. 
The  fraction  Qi,2  is  an  “asymptotic”  constant. 
The  theory  derives  this  constant  cis  a  limit  when 
n  — ►  oo.  The  second  drawback  is  the  effect  of  the 
edges  of  the  square.  It  is  reasonable  to  suppose 
that  leaves  are  more  frequent  near  the  “edges”  of 
the  sample.  For  fixed  d  this  effect  may  diminish 
as  n  increases. 

The  details  of  the  theoretic2d  derivation  in 
Steele  et  al.  (1987)  depend  on  the  closeness  of 
a  homogeneous  planar  Poisson  process  to  a  sam¬ 
ple  from  a  uniform  distribution  on  the  square. 
In  fact  the  constants  ak,2  are  actually  properties 
of  the  homogeneous  Poisson  process.  Here  we 
will  generate  partial  realizations  (subsets)  of  the 
homogeneous  Poisson  process  and  will  determine 
the  vertex  degree  of  only  one  vertex  Uq-  The  sub¬ 
sets  only  contain  vertices  in  the  vicinity  of  the 
chosen  vertex.  One  additional  benefit  of  our  ap¬ 
proach  (which  we  have  not  taken  advantage  of) 
is  that  properties  other  than  MST  ertex  degrees 
could  be  determined  in  the  same  way  (for  ex¬ 
ample,  the  number  of  V'oronoi  neighbors).  The 
Voronoi  polygon  of  each  vertex  of  the  Poisson 
process  is  that  subset  of  the  plane  consisting  of 
all  points  which  are  closer  to  the  given  vertex 
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than  to  ciny  other  vertex  of  the  Poisson  process. 
Two  vertices  of  the  Poisson  process  are  Voronoi 
neighbors  if  their  Voronoi  polygons  share  a  com¬ 
mon  edge. 

Our  approach  is  to  generate  a  local  piece  of 
the  Poisson  realization  by  generating  the  vertices 
of  the  process  which  are  nearest  to  vq.  We  de¬ 
termine  as  much  of  the  MST  locally  as  we  can, 
beginning  at  vq.  The  vertex  degree  of  t)o  in  this 
partial  MST  is  a  lower  bound  on  its  vertex  de¬ 
gree  in  the  full  MST  of  the  entire  Poisson  process. 
We  continue  sampling  and  generating  more  of  the 
MST  untU  all  the  Voronoi  neighbors  of  vo  have 
been  joined  to  the  MST.  At  this  point  the  vertex 
degree  of  vq  is  exact.  This  ncdve  approach  gener¬ 
ated  a  new  problem:  the  procedure  often  requires 
generation  of  very,  very,  large  numbers  of  vertices 
of  the  Poisson  process.  A  first  revision  of  this  ap¬ 
proach  was  to  “grow”  the  MST  simultaneously 
from  many  vertices.  This  provided  considerable 
improvement  but  was  still  unsatisfactory.  Our 
second  modification  Wcis  to  determine  an  upper 
bound  on  the  vertex  degree  of  vq  by  determining 
the  full  MST  of  the  subset  of  the  Poisson  process. 

In  section  4  we  give  our  estimates  of  the  0^,2 
together  with  some  “conservative”  95%  confi¬ 
dence  intervals.  These  estimates  are  based  on 
determining  the  vertex  degree  of  approximately 
1.6  X  10*^  vertices  (but  required  the  generation  of 
a  much  larger  number  of  vertices  of  the  Poisson 
process) 

2  SAMPLING  A  POISSON  PROCESS 

Let  u,i  be  any  vertex  in  the  homogeneous  Pois¬ 
son  process  with  intensity  A  in  d  dimensions. 
Let  Ri,  Ri.  ■  ■ .  be  the  ordered  distances  from  v,. 
to  the  other  vertices.  The  joint  distribution  of 
«■'.  R'i, .  .  .  is  known  to  be  exactly  the  same  as  the 
distribution  of  the  ordered  distances  from  the  ori¬ 
gin  for  a  homogeneous  Poisson  process  on  the  line 
with  intensity  proportional  to  A.  See,  for  exam¬ 
ple,  Kendall  and  Moran  (196!).  Since  the  homo¬ 
geneous  planar  Poisson  process  is  isotropic,  it  is 
easy  to  see  that  the  ordered  distances  from  a  ho¬ 
mogeneous  linear  Poisson  process  jiaired  togetlier 
with  angles  uniformly  distributed  on  Ii.  'Jtt  yield 
a  planar  process  which  is  I’oissoii. 

Since,  in  our  application,  we  aritn  ijiale  iiee,|nig 
only  those  vertices  of  the  Poisson  [iroi  ess  whn  h 
are  near  neighbors  of  the  chosen  vertex  r,.  we 
generate  the  vertices  <>f  the  [>:»(  ess  in  an  ordered 
fa.shion.  More  precisely,  we  define  .tii  im  re.vsing 
sequence  of  sample  sizes  1:.  1  .  and  we 

generate  .an  iricre.Lsirig  se<pience  of  .  irciilai  sub 


sets  Vo,Vi,...  where  V,  contains  the  n,  nearest 
neighbors  of  vq-  Thus, 

Vo  =  {uo} 

and 

Vo  C  Vi  C  . . . 

This  is  not  a  new  idea;  see,  for  example,  Quine 
and  Watson  (1984). 

The  advantage  of  this  procedure  in  applica¬ 
tions  such  as  ours  is  obvious;  it  is  only  necessary 
to  generate  as  much  of  the  Poisson  process  as  is 
necessary  to  determine  the  property  of  interest. 
Of  course,  it  has  the  disadvantage,  compared  to 
the  procedure  in  Steele  et  al.  (1987),  of  requiring 
the  generation  of  a  great  many  more  vertices. 

3  DETERMINING  VERTEX  DEGREE 

In  this  section,  we  develop  an  algorithm  to  find 
the  asymptotic  degree  of  a  vertex  in  a  given  real¬ 
ization  of  the  planar  Poisson  process.  More  for¬ 
mally,  if  Vo,  Vi, . . .  is  a  sequence  of  circular  sub¬ 
sets  constructed  as  in  the  previous  section,  with 
Vo  C  Vi  C  ...,  and  d,  is  the  degret*  of  vert'^x 
Vr)  in  the  MST  of  V,,  we  develop  an  algorithm 
to  determine  lim,_^(i,.  By  running  this  algo¬ 
rithm  on  a  large  number  of  realizations,  we  may 
obtciin  =  /jt,  the  observed  relative  frequency 
of  vertices  of  degree  k. 

This  procedure  is  very  computationally  inten¬ 
sive.  It  generates  a  subset  of  the  planar  Pois¬ 
son  proce.ss,  computes  a  property  (vertex  degree) 
of  one  vertex,  then  throws  the  subset  away  and 
starts  over.  Why  not  compute  the  degree  of  more 
than  one  vertex  in  a  given  subset?  This  would  be 
possible,  in  theory.  In  practice,  however,  we  are 
interested  in  frequency  estimates  whose  uncer¬ 
tainty  is  easily  computed.  If  we  examine  one  ver¬ 
tex  from  each  realization,  the  observed  frequen¬ 
cies  have  a  multinomial  distribution.  Standard 
errors  and  confidence  interv;ds  may  be  (umpiited 
using  standard  statistical  theory.  If  we  exam¬ 
ine  more  than  one  vertex,  the  distributirm  of  the 
'iliserveii  frerpiem  les  1.,-  unknown,  and  we  would 
have  to  verify  elementary  assumptions  (e.g.  that 
<>),  j  IS  the  expected  v.ilue  of  /j,  i . 

.'{.I  A  Naivi'  Algorillim 

I'or  fixed  sub  s,impl''  sue  -i  I  'r  i m  I  1  ',i '  7  )  sli ■ . ws 
th.it  the  MSI  (.111  lie  built  ly  le|)e,it''d  .ippln  i 
ti"!!  of  the  following  two  (inin  iplis; 

1*1:  .Any  isolated  vert,  x  may  b.  ,  .aim  ■  te.f  t.,  nv 
lie. lit  st  lit  lg|i|,,,i 
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P2:  Any  tree  can  be  connected  to  a  nearest 
neighbor  by  the  shortest  avcdlable  edge. 

In  particular,  the  MST  may  be  built  by  applying 
PI  to  the  vertex  vq,  followed  by  n-2  applications 
of  P2. 

To  these  two  principles,  we  add  a  third  princi¬ 
ple  and  a  stopping  rule: 

P3:  If  the  edge  to  be  added  by  PI  or  P2  is 
longer  than  the  shortest  distance  from  the 
vertex  (PI)  or  tree  (P2)  to  the  edge  of  the 
sampling  region,  sample  more  vertices  and 
try  again. 

SI:  Stop  when  vq  and  all  its  Voronoi  neighbors 
appear  in  the  same  MST. 

By  applying  PI  to  the  vertex  vq,  and  then 
applying  P2  and  P3  as  often  as  necessary,  the 
algorithm  stops  with  the  degree  of  vq  equal  to 
lim„_,,cd„.  To  see  this,  consider  the  opera¬ 
tion  of  the  algorithm  on  some  subset  Vj,  of  ver¬ 
tices.  Principle  P3  guarantees  that  the  only 
edges  added  to  the  tree  are  those  that  will  be  in 
the  MST  of  subsamples  V^+i,  Vfc+2--  -  Once  vo 
and  its  Voronoi  neighbors  are  in  the  same  tree, 
adding  another  :dge  between  t^o  and  one  of  the 
Voronoi  neighb.  s  will  create  a  cycle.  Since  the 
only  edges  having  vo  as  one  endpoint  must  have 
a  Voronoi  neighbor  as  the  other  endpoint,  we  see 
that  no  further  MST  edges  can  have  vq  as  an 
endpoint. 

The  performance  of  this  algorithm  is  poor.  Ta¬ 
ble  1  shows  the  results  of  running  the  algorithm 
on  5000  realizations  of  the  Poisson  process.  In 
over  30  percent  of  these  (1651/5000)  the  algo¬ 
rithm  terminated  after  sampling  16384  vertices 
without  determining  the  vertex  degree  of  V(,. 

At  our  talk  in  Reston,  we  showed  a  video¬ 
tape  illustrating  the  performance  of  this  algo¬ 
rithm  and  the  other  algorithms  discussed  below. 
Figure  1  was  adapted  from  the  videotape,  u,)  is 
the  large  filled  circle  in  the  center  of  the  figure. 
The  small  filled  circles  represent  other  vertices 
in  the  MST.  The  edges  of  the  MST  are  repre¬ 
sented  by  line  segments.  The  concentric  large  cir¬ 
cles  show  successive  circular  subsets.  Each  subset 
contains  twice  as  many  vertices  as  the  previous 
one.  The  outer  circle  contains  2048  vertices.  The 
filled  circle  immediately  to  the  left  of  I'n  is  the  one 
Voronoi  neighbor  not  included  in  the  MST. 

Reasons  for  the  poor  performance  of  this  algo¬ 
rithm  are  not  hard  to  find.  It  is  well  known  (see 
for  exanmle  Bentley  and  Friedman  1978)  th.a^  in 
the  construction  of  minima'  spanning  trees  by 


Table  1;  Number  of  Vertices  Required  to  Deter¬ 
mine  Vertex  Degree  in  5000  Re^llizations 


Number  of 
Vertices 

Algorithm 

Naive 

Rev.  1 

Rev.  2 

0-64 

1066 

1857 

4435 

65-128 

425 

682 

247 

129-256 

404 

521 

154 

257-512 

383 

484 

84 

513-1024 

342 

322 

45 

1025-2048 

266 

251 

19 

2049-4096 

196 

235 

11 

4097-8192 

155 

148 

4 

8193-16384 

112 

106 

0 

Failure  (>  16384) 

1651 

394 

1 

Prim’s  algorithm,  trees  tend  to  grow  “uphUl”  to¬ 
wards  areas  of  greater  vertex  density.  If  some  of 
the  Voronoi  neighbors  of  vq  be  across  a  “valley” 
in  the  vertex  density,  the  MST  may  grow  very 
large  before  crossing  the  vabey.  For  example,  in 
Figure  1,  the  unconnected  Voronoi  neighbor  is 
part  of  a  small  cluster  of  vertices  separated  from 
the  remaining  vertices  by  a  relatively  large  gap. 
The  MST  might  cross  the  gap  eventuaUy;  we  gave 
up  waiting  after  sampling  32768  vertices. 

3.2  Revision  1 

Based  on  results  similar  to  those  in  Table  1,  we 
concluded  first  that  it  was  not  clear  that  our  eil- 
gorithm  would  ever  t  'rminate  in  many  Ccises,  and 
second  that  even  if  it  did  terminate,  it  would  re¬ 
quire  too  much  computer  time.  Accordingly,  we 
set  out  to  modify  our  algorithm. 

By  applying  principle  PI  more  than  once,  it 
is  possible  to  use  Prim’s  algorithm  to  grow  more 
than  one  tree  at  a  time.  We  modified  our  algo¬ 
rithm  first  to  apply  PI  to  all  the  Voronoi  neigh¬ 
bors  of  I’d,  and  later  so  that  it  “grew"  trees  wher¬ 
ever  it  could  (up  to  a  limit  of  about  50).  We 
felt  that  this  might  provide  more  opportunities 
to  cross  “valleys”  in  the  vertex  density.  F'or  ex¬ 
ample,  consider  the  unconnected  Voronoi  neigh¬ 
bor  in  Figure  1.  If  a  tree  is  started  here,  it  grows 
to  connect  all  the  vertices  in  the  isolated  cluster. 
The  next  edge  added  connects  this  tree  with  the 
tree  containing  I'n  The  final  configuration,  with 
a  single  tree  containing  i„  and  all  its  Voronoi 
neighbors,  is  shown  in  Figure  2.  It  contains  only 
.512  vertices,  a  substantial  improvement. 

In  general  this  extension  produced  consider¬ 
able  imprnvm.ent.  .\f.  Table  I  shows,  our  revised 
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Figure  1:  Naive  algorithm  applied  to  a  sample  of  2048  vertices 


Figure  2:  The  algorithm,  including  our  first  revi¬ 
sion,  applied  to  the  sample  of  Figure  1. 

algorithm  was  unable  to  determine  the  vertex  de¬ 
gree  of  uo  in  less  than  ten  percent  of  the  sam¬ 
ples  (394/5000),  and  in  general  needed  to  look  at 
fewer  vertices. 

Unfortunately,  the  improvement*,  was  not 
enough.  We  conjecture  that  if  we  let  this  algo¬ 
rithm  sample  up  to  100,000  vertices,  we  would  be 
unable  to  determine  the  degree  of  vq  roughly  one 
percent  of  the  time.  With  this  much  uncertainty, 
the  simple  confidence  limits  for  a  1,2  constructed 
in  Section  4  would  always  include  |.  It  might 
have  been  possible  to  obtain  narrower  confidence 
limits  by  treating  the  unresolved  Ccises  as  cen¬ 
sored  in  some  fcLshion.  We  felt  that  we  did  not 
know  enough  about  the  censoring  mechanism  to 
make  this  feasible. 

3.3  Revision  2 

Two  very  important  points  provide  us  with  a  re¬ 
vised  and  (at  long  last)  useful  algorithm.  First, 
as  discussed  above,  the  algorithms  outlined  so 
far  compute  a  lower  bound  on  the  degree  of  Cy. 
Edges  can  be  added  to  vq  at  any  step,  but  can 
never  be  deleted.  Second,  it  is  possible  to  com¬ 
pute  an  upper  bound  on  the  degree  of  i',,.  When 
the  upper  and  lower  bounds  arc  equal,  no  more 
vertices  need  be  sampled. 

The  upper  bound  may  be  obtained  from  the 
following 

Lemma  3.1  Consider  the  full  minimal  spanning 
trees  of  two  sets  Si  and  S^  of  vertices,  with  Si  ( 
Si  If  c  {e,  e' }  IS  an  edge  of  the  complete  graph 
on  Si  with  c  (f  MST(5i),  then  c  '/  MST1.52). 


Proof  This  is  the  contrapositive  of  Lemma  2.1 
of  Steele  et  al.  (1987).  ■ 

Now  consider  some  circular  subsample  Vk  of 
vertices  generated  by  the  algorithm  of  the  previ¬ 
ous  se:tion.  The  cJgorithm  provides  dt,  a  lower 
bound  on  the  degree  of  Vq,  and  a  set  of  trees.  Sup¬ 
pose  that  no  more  edges  can  be  added  to  these 
trees  without  sampling  more  vertices.  Instead  of 
doing  this,  the  revised  algorithm  remembers  the 
trees  and  then  uses  principles  PI  and  P2  to  turn 
the  trees  into  the  full  'dST  of  Vi.  Provided  that 
all  the  Voronoi  neighbors  of  vq  are  in  Vt,  Lemma 
3.1  states  that  the  degree  of  vq  will  never  exceed 
the  degree  attained  in  this  full  MST,  and  hence 
is  an  upper  bound  on  the  degree  of  vo  in  Vt+i, 
Vit+2  ■  •  ••  If  the  degree  of  Vq  is  dk  in  the  full  MST, 
that  is,  if  the  degree  did  not  change  when  the  full 
MST  was  built,  then  the  lower  and  upper  bounds 
are  equal,  and  the  algorithm  can  stop.  On  the 
other  hand,  if  edges  were  added  to  vo,  the  up¬ 
per  bound  is  greater  than  the  lower  bound.  In 
this  case,  the  algorithm  must  return  to  the  trees 
saved  earber  apply  P3,  and  continue  on. 

Figures  3  and  4  illustrate  this  algorithm  in  opn 
eration.  The  dotted  lines  represent  edges  added 
in  building  the  full  MST;  the  circular  subsamples 
are  respectively  the  first  subsample  (Figure  3) 
and  the  first  two  subsamples  (Figure  4)  from  Fig¬ 
ure  1.  In  Figure  3,  the  algorithm  has  started  to 
build  the  MST,  and  has  added  six  edges.  An 
edge  has  been  added  to  vq.  At  this  point,  the 
algorithm  recognizes  that  the  upper  bound  on 
the  degree  of  vq  is  at  least  two,  while  the  lower 
bound  is  one.  Construction  of  the  rest  of  the 
full  MST  is  pointless.  In  Figure  3,  the  algorithm 
has  sampled  more  vertices,  expanded  the  existing 
trees,  and  created  some  new  ones.  This  did  not 
produce  a  tree  containing  vn  and  all  its  Voronoi 
neighbors,  so  the  full  MST  was  built  again.  This 
time,  the  degree  of  i’,,  did  not  change.  Thus,  the 
upper  bound  on  the  degree  is  now  equal  to  the 
lower  bound,  and  no  further  sampling  is  needed. 

Table  1  shows  that  this  revised  algorithm 
needed  64  or  fewer  vertices  to  determine  the  de¬ 
gree  of  I'n  in  nearly  ninety  percent  of  the  samples 
(4435/5000).  The  one  case  requiring  more  than 
16,384  vertices  was  resolved  with  32,768  vertices. 

4  RESULTS  AND  DISCUSSION 

In  theory,  it  is  only  necessary  to  sample  enough 
vertices  to  make  the  nearest  neighbor  distance 
less  than  the  distance  to  the  edge  of  the  sam¬ 
pling  region.  In  practice,  the  nearest  neighbor 


Table  2;  Observed  Vertex  Degree  in  1,677,576 
Simulation  Runs 


Degree 

k 

Resolved 

mk 

Unresolved 

nrjt.fc-i-i 

1 

371032 

60 

2 

948732 

53 

3 

345270 

3 

4 

12424 

0 

5 

2 

0 

Table  3:  Confidence  Intervals  for  ak.2 


Degree 

k 

95%  Confidence 
Interval 

1 

0.220543 

0.221836 

2 

0.564787 

0.566355 

3 

0.205203 

0.206460 

4 

0.007276 

0.007537 

5 

* 

* 

algorithm  that  we  use  (Friedman  et  al.  1977) 
has  a  high  setup  cost.  To  keep  the  overhead  cost 
down,  we  double  the  sample  size  each  time  v.'e 
need  to  sample  more  vertices. 

Bentley  and  Friedman  (1978)  suggest  that 
the  performance  of  their  algorithm  may  degrade 
when  it  is  used  to  construct  the  MST  of  a  large 
set  of  vertices.  We  observed  this  behavior  when 
building  the  full  MST,  but  not  when  building  tree 
fragments.  This  is  consistent  with  their  explana¬ 
tion.  We  found  that  using  an  algorithm  due  to 
Dwyer  (1987)  to  build  the  full  MST  when  the 
number  of  vertices  was  large  (more  than  16,384) 
made  the  simulations  for  large  cases  run  two  or¬ 
ders  of  magnitude  faster. 

4.1  Confidence  Intervals 

A  summary  of  the  raw  figures  from  our  simula¬ 
tion  study  is  shown  in  Table  2.  Cases  rbat  could 
not  be  resolved  by  sampling  131072  vertices  have 
been  tabulated  according  to  the  lower  bound  on 
the  degree  of  U(|.  In  all  cases  the  upper  bound 
on  the  degree  is  one  larger.  Confidence  intervals 
for  the  ci|t  2  are  shown  in  Table  3.  Since  we  ob¬ 
served  only  two  vertices  of  degree  five,  we  have 
not  shown  a  confidence  interval  for  or  i.  Fhe  in¬ 
terval  for  rt|  2  does  not  lend  support  to  the  spec¬ 
ulation  that  O’ 1.2  . 

The  confidence  intervals  were  constructed  as 
follows:  Let  iiik  be  the  number  of  vertices  whose 
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degree  is  known  to  be  rnk,k+i  be  the  number 
of  vertices  whose  degree  has  not  been  determined 
but  is  known  to  be  or  fc  +  1,  and  m  be  the  total 
number  of  simulation  runs.  (For  our  data,  m  = 
1,  677,  576,  mi  =  371,  032,  mi, 2  =  60,  and  so  on.) 
Since  simulation  runs  are  independent,  we  may 
view  the  occurrence  of  a  vertex  of  degree  k  in 
a  given  run  cis  a  Bernoulli  trial  with  probability 
ak.2-  If  we  knew  the  exact  vertex  degree  in  every 
simulation  run,  we  would  estimate  ait,2  as 


-  .  rrik 

otk.2  -  Jk  =  - , 

m 


and  then  construct  confidence  intervals  using  the 
usual  normal  approximation.  Although  we  do 
not  know  the  exact  vertex  degree  in  every  run, 
we  may  still  construct  conservative  confidence  in¬ 
tervals.  If  all  vertices  whose  degree  has  not  been 
determined  but  is  known  to  be  1:  —  1  or  1:  had 
degree  k,  and  all  vertices  whose  degree  hcis  not 
been  determined  but  is  known  to  be  A;  or  4-  1 
also  had  degree  k,  we  would  estimate  ak.2  as 


<^k.2 


nik  -f  ruk-uk  4-  ruk.k+i 
m 


If  all  vertices  whose  degree  has  not  been  deter¬ 
mined  but  is  known  to  be  A:  —  1  or  A:  had  degree 
A:  —  1,  and  all  vertices  whose  degree  has  not  been 
determined  but  is  known  to  be  A:  or  A:  4-  1  had 
degree  A:  4-  1,  our  estimate  of  at, 2  would  be 


. — .  ruk 

ak.2  ~  - ■ 

m 

In  either  Ccise,  the  usual  normal  approximation 
may  be  used  to  construct  confidence  intervals. 
By  taking  the  union  of  the  two  intervals  formed 
in  this  way,  we  obtain  a  conservative  confidence 
interval  for  ak.2- 

4.2  Point  Estimates 

Producing  point  estimates  is  more  difficult.  We 
may  write  a  log  likelihood  function  for  the  ob¬ 
served  counts  as 

log  £{qi  2, 02.21  <^3.2i  0:4.2.  Qs.2)  — 

m-k  log  at, 2  4- 

k-l 

mk.k  )  1  log(Q/i.2  f  ak  ,  I.j)- 

k  ---  1 


Table  4:  Estimated  Vcdues  of  ak.2 


k 

ak.2 

1 

0.221182 

2 

0.565586 

3 

0.205825 

4 

0.007406 

5 

0.000001 

a  fairly  sophisticated  function  maximization  rou¬ 
tine  (Gay  1983)  and  the  initial  estimates 


^1.2  — 


Qro  o 


03.2 

^4,2 

^0.2 


1^(1  +  ^1.2  ' 

m  mj  4-  m2  ' 
m-y  , 

—  (1  + 
m 

m3 


(1  + 
fU 

^(1+ 

m 

ms 


mi. 2 

-b 

m2. 3 

TTLi  -b  m2 

m2  4-  m3 

^2,3 

4- 

«3,4 

mi  4-  m3 

m3  4-  m4 

m3, 4 

4- 

m4,j 

m3  4-  m4 

m4  4-  mj 

Tn^.z 


(I  + 

m  m4  -t-  ms 


These  initial  estimates  allocate  unresolved  cases 
based  on  the  observed  relative  frequencies  of  re¬ 
solved  cases.  The  maximization  routine  was  un¬ 
able  to  do  any  better  than  the  initial  estimates. 
The  values  of  the  initial  estimates  are  shown  in 
Table  4. 


4.3  Discussion 

The  estimates  in  Steele  ct  al.  (1987)  agree  with 
ours  to  two  decimal  places  for  samples  as  small  at 
256  vertices,  and  to  three  decimal  places  for  sam¬ 
ples  of  4096  vertices.  This  suggests  that  edge  ef¬ 
fects  decrease  rapidly  as  n  becomes  large.  It  also 
suggests  that  the  constants  a„.k.2  approach  ak.2 
at  a  reasonable  rate.  Thus,  it  ought  to  be  possi¬ 
ble  to  estimate  a*,  j  in  more  than  two  dimensions 
by  generating  vertices  uniformly  inside  a  d-cube, 
building  the  minimal  spanning  tree,  and  tabu¬ 
lating  the  observed  relative  frequencies  of  each 
vertex  degree.  Examination  of  the  behavior  of 
these  frequencies  shotild  give  some  idea  of  how 
large  the  samples  should  be.  If  we  sound  cau- 
tiotis  here,  it  is  because  intuition  on  this  problem 
has  been  wrong  in  the  past.  Our  methodology 
also  generalizes,  and  hence  our  study  could  be 
repeated  in  (say)  three  dimensions  as  a  check  on 
intuition. 


Finding  a  maximum  of  this  analytic.ally  appe.ars 
to  be  difficult.  We  tried  to  maximize  this  using 
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matrix  completions,  determinantal  maximization  and  maximum  entropy 


Charles  R.  Johnson*, 
Wayne  W.  Barrett 

1 .  Introduction 

A  pgrtigi  matrix  is  one  in  which  some 
entries  are  (numerically)  specifier!  and  the 
remainder  are  unspecified,  i.e..  left  as  free 
variables  over  some  set  (e.g.,  the  field  of 
complex  numbers.)  An  example  is 


rhe  College  of  William  and  Mary 
,  Brigham  Young  University 

which  "patterns"  for  the  specified  entries 
ensure  an  affirmative  answer  to  the  matrix 
completion  problem,  as  long  as  specified 
submatrices  meet  the  obvious  necessary 
conditions?  The  next  section  will  review 
the  solution  to  the  completion  problem  for 
positive  definite  matrices. 


■  1  -3  ?  ' 

?  2  0 
.7  4  3  . 

in  which  the  "?"'s  indicate  unspecified 
entries.  A  completion  of  a  partial  matrix  is 
i-imply  a  specification  of  the  unspecified 
entries,  resulting  in  a  conventional  matrix. 
For  an  indicated  class  of  matrices  (such  as 
positive  definite,  or  rank  <  k),  the  matrix 
gompletipn.  problem  is  to  identify  partial 
matrices  for  which  there  is  a  completion  in 
the  indicated  class. 

Among  the  completion  problems  that  have 
been  considered  are:  positive  definite 
completions  [B,  DGo,  GJSW];  inertia 
possibilities  [JR1];  contractions  with 
respect  to  the  spectral  norm  [JR2]; 
minimum  rank  completions  [W,  JRW]; 
positive  definite  Toeplitz  completions 
(Johnson  and  Rodman  have  been  studying 
extensions  of  and  a  converse  to  the 
classical  Caratheodory/Fejer  theorem); 
completions  of  a  Toeplitz  contraction  [JR4]; 
and  completions  which  maximize  the 
minimum  eigenvalue  of  a  partial  Hermitian 
mairix  [JR1],  Others  whicli  might  be 
considered  include  stability,  controllability 
etc. 

A  feature  common  to  many  matrix 
completion  problems  is  that  the  class  of 
matrices  of  interest  has  an  "inheritance 
property  .  Namely,  all  principal 
submatrices  or  all  submatrices  of  the  given 
matrix  are  in  the  same  class.  For  example, 
all  principal  submatrices  of  a  positive 
definite  matrix  are  positive  definite  and  all 
submatrices  of  a  rank  <  k  matrix  have  rank  < 

k.  This  imposes  a  necessary  condition  on 
any  partial  matrix  that  it  be  completable  to 
a  matrix  in  such  a  class;  namely,  any  fully 
specified  submatrix  of  the  necessary  sort 
must  be  in  the  desired  class. 

This  raises  a  natural  combinatorial 
question:  Given  some  class  of  matrices. 


2.  Positive  Definite  Case  (General 
Theory) 

We  begin  by  defining  terms  and 
introducing  notation  we  need  to  describe 
these  results.  A  partial  Hermitian  matrix 

A  =  (a  )  is  a  square  partial  matrix  whose 
I  i 

specified  diagonal  entries  are  real  and  such 

that  if  a.  .  is  specified,  then  so  is  a.  .,  with 
u  i  I 

a..  =  a.  ..  A  partial  positive  definite  matrix 

is  a  partial  Hermitian  matrix  each  of  whose 
specified  principal  submatrices  is  positive 
definite.  (By  a  specified  portion  of  a  partial 
matrix  we  always  mean  one  composed 
entirely  of  specified  entries.)  Partial 
positive  semidefinite  matrices  are  defined 
similarly.  We  say  that  a  partial  positive 
definite  (positive  semidefinite)  matrix  is 
completable  if  it  has  a  positive  definite 
(positive  semidefinite)  completio'i.  If  A  is 
an  n-by-n  partial  (or  full)  Hermitian  matrix 

and  ac{l.2 . n}  is  an  index  set,  A[ol 

denotes  the  principal  submatrix  of  A 
contained  in  the  rows  and  columns  indicated 
by  a. 

We  illustrate  witn  the  simple  special 
case 


A  = 


(2,1) 


in  which  x  is  the  one  unspecified  entry. 
(Throughout  we  refer  to  an  unspecified  pair 
(x,x)  as  one  unspecified  entry.)  The  obvious 
necessary  conditions  that  A  have  a  positive 
definite  completion  are: 


>  0  ,  ■>  0  .  >  0  , 


a. 

b, 

a  b 

1 

1 

V 

o 

2  2 

'’2 
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By  the  well  known  criterion  that  a 
Hermitian  matrix  is  positive  definite  if  its 
leading  principal  minors  are  positive  [HJ,  p. 
404],  we  see  that  A  is  completable  if  x  can 
be  chosen  so  that  detA>0.  It  is  a 
straightforward  calculation  (e.g.  see 
equation  (5.1)  in  [BF])  that, 


det  A  = 


a.  b . 
2  2 

2  3 


-  b,b2-a2x 


(2.3) 


1  P 

Since  a„>0,  we  may  set  x  =  -^  ■  ensuring 

2 

that  det  A>0.  Therefore  A  is  always 
comp'etable  provided  the  necessary 
conditions  (2.2)  are  met.  Furthermore,  the 
set  of  all  X  which  give  a  positive  definite 
completion. 


is  a  disc  in  the  complex  plane  whose  center 


X  = 


gives  the  maximum  possible 


determinant  for  a  completion  of  A.  We  call 

the  completion  with  ^  —  the 

2 

determinant-maximizing  completion  of  A. 

b,b. 


12 


is  equivalent 


b,  X 

^2 

a  b 

. 

X  b_ 

2  2 

2 

Notice  that  setting  x  = 


to  setting  the  cofactors 


of  A  equal  to  0,  which  is  the  same  as 

requiring  that  the  1,3  and  3,1  entries  of  A 

be  0.  These  are  precisely  the  entries  in  A  ' 
which  correspond  to  unspecified  entries  in 
A. 


Now  suppose  that  A  is  an  n-by-n  partial 
positive  definite  matrix  with  the  1  ,n  entry 
as  its  only  unspecified  entry.  Applying 
equation  (5.1)  of  [BF]  in  exactly  the  same 
manner  we  see  that  A  is  completable  if  the 
necessary  conditions  are  met.  If  the 
principal  submatrices  A[{1,  2 ,  ...,  n -- 1}]  and 
A[{2,  3 ,  ...,  nj]  are  positive  definite  the  1,n 
entry  may  be  chosen  so  as  to  make  det  A 
(and  all  its  principal  minors)  positive.  The 
set  of  all  values  giving  a  positive  definite 


completion  is  again  a  disc  in  the  complex 
plane  and  the  center  gives  the  maximum 
determinant.  Again  the  1,n  and  n,1  entries 
of  the  inverse  being  0  characterizes  the 
determinant-maximizing  completion.  This 
case  seems  first  to  have  been  noted  in  [B]  by 
different  means  and  from  the  point  of  view 
of  maximum  entropy  methods.  That  work, 
noting  the  connection  between  maximum 
entropy  and  determinant  maximization, 
motivated  interest  in  positive  definite 
completion. 

In  [DGo]  a  square  partial  matrix  is  called 
banded  if  all  the  entries  within  some  band 


width  (symmetric  from  the  diagonal)  are 
specified  and  all  entries  outside  are 
unspecified.  Recall  that  a  full  matrix 
A=(a.  .)  is  called  banded  with  band  width  k 


if 


a.  .=  0 

I ) 


whenever  |i  -  j  |  >  k  . 


The  above 


discussion  of  a  single  unspecified  entry 
(whose  position  could  be  arbitrary)  is  just 
the  special  case  of  an  n-by-n  partial 
positive  definite  matrix  with  band  width  n  - 
2. 


Assuming  that  A  is  an  n-by-n  banded 
partial  matrix  three  principal  conclusions 
are  drawn  in  [DGo].  (The  result  (iii)  is  also 
contained  in  [BF].) 

(i)  positive  definite  completions  of  A 
necessarily  exist! 

(ii)  There  exists  a  unique  determinant- 
maximizing  positive  definite 
completion. 

(iii)  There  exists  a  unique  completion 
which  is  nonsingular  and  whose 
inverse  is  banded  (with  the  same 
band  width)  in  the  usual  sense;  this 
is  also  the  determinant-maximizing 
completion. 

We  now  consider  the  general  question: 
for  which  partial  Hermitian  matrices  do 
positive  definite  completions  exist? 
Necessarily,  the  matrix  must  be  partial 
positive  definite,  but  is  this  condition 
sufficient  to  ensure  a  positive  definite 
completion?  If  not,  which  patterns  (in 
addition  to  banded)  for  the  specified  entries 
guarantee  a  positive  definite  completion? 

We  first  note  that  positive  definite 
completions  need  not  always  exist  even 
when  the  obvious  necessary  conditions  are 
met.  It  is  easy  to  see  ([GJSW])  that  given  a 
pattern  for  the  specified  entries,  the  matrix 
completion  problems  for  positive  definite 
and  positive  semidefinite  matrices  are 
equivalent.  For  simplicity  we  consider  the 


547 


positive  semidefinite  case,  and  consider  the 
partial  Hermitian  matrix 


1  1  1 

1  1  X 

1  X  1 

y  -1  1 


y 

-1 

1 

1 


(2.4) 


It  is  partial  positive  semidefinite  as  the 
only  specified  principal  submatrices  are 


[11 


■  L1  1 J 


,  and 


1  -1  ■ 
-1  1 


However  x  completes  both  partial  principal 
submatrices 


and 


B[{1.2,3)] 


1  1  1  ■ 
1  1  X 
1  X  1 


B({2,3,4}]  = 


1 

X 

-1 


X 

1 

1 


-1 

1 

1 


I  |2 

Since  det  B[{1 , 2,  3}]  = -|x  -  1|  and 


I  |2 

det  B[{2 , 3 , 4  }]  = -|x  +  l|  ,  the  first 
submatrix  requires  that  x  =  1  for  a  positive 
semidefinite  completion  while  the  second 
requires  that  x  =  -1.  As  these  are  in 
conflict,  B  is  not  completable  to  a  positive 
semidefinite  matrix. 

Of  course,  some  partial  positive  definite 
matrices  whose  specified  entries  have  the 
same  pattern  as  B  may  have  positive 
definite  completions.  Take  any  4-by-4 
positive  definite  matrix  C  and  replace  the 
secondary  diagonal  by  (Hermitianly) 
unspecified  entries.  Then  this  partial 
matrix  has  a  positive  definite  completion, 
namely  C. 


The  interesting  question  then  is:  for 
which  patterns  does  a  partial  positive 
definite  matrix  always  have  a  positive 
definite  completion.  This  question  is 
addressed  in  [GJSW]  and  a  characterization 
of  completable  patterns  is  given.  A  natural 
way  of  describing  patterns  is  in  terms  of 
the  undirected  graph  G=G(A)  of  the 
specified  entries. 

Given  an  n-by-n  partial  Hermitian  matrix 
A,  G(A)  has  vertex  set  {1,2,  ...,n}  and  an 


edge  between  i  and  j,  i  #  j  if  and  only  if  the 
i,j  entry  of  A  is  specified.  Thus,  the  partial 
matrix  B  above  has  the  graph 


G>- 

Undirected  graphs  are  appropriate  because 
we  assume  the  partial  matrix  Hermitian. 
Without  loss  of  generality,  from  now  on  we 
assume  that  all  diagonal  entries  of  A  are 
specified  (because  a  partial  positive 
definite  matrix  is  completable  if  and  only  if 
■he  principal  submatrix  corresponding  to  the 
specified  diagonal  entries  is  completable.) 

We  briefly  review  some  basic  ideas  about 
undirected  graphs.  A  oath  ( i  .  i  . i  )  is  a 

It  K 

sequence  of  vertices  such  that  {i.  is 

an  edge  of  G  for  j=1 . k-1.  G  is 

connected  if  there  is  a  path  between  any 

two  vertices  in  V.  A  circuit  is  a  path  for 

which  i  =  i  and  k  >  3.  A  simple  circuit  is  a 
k  1 

circuit  for  which  i  ,i  . i,  are  distinct. 

12  k-1 

A  chord  of  a  circuit  is  an  edge  joining  two 
nonconsecutive  vertices  in  the  circuit.  A 
circuit  is  minimal  if  it  has  no  chord.  For 
example,  in  the  graph  G, 


(1,  2,  5,  4}  is  a  path,  {1,  2,  3,  4,  5,  1}  is  a 
simple  circuit,  (2,  5}  is  a  chord  of  this 

circuit,  and  (2,  3,  4,  5,  2}  is  a  minimal 

simple  circuit. 

The  key  notion  which  allows  a  simple 

description  of  completable  patterns  is  that 
of  a  chordal  graph:  we  call  G  chordal  if  it 
has  no  minimal  simple  circuits  of  four  or 
more  edges.  Thus,  the  graph  G,  above  is  not 
chordal  because  of  the  minimal  simple 

Circuit  (2,  3,  4,  5,  2}.  However  addition  of 
the  single  edge  [2,  4)  would  make  it  a 
chordal  graph.  A  good  general  reference  for 
chordal  (also  called  triangulated)  graphs  is 
Chapter  4  of  [G].  They  have  been  heavily 
studied  in  graph  theory  and  have  arisen 
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before  in  numerical  linear  algebra,  in  the 
study  of  Gaussian  elimination  on  sparse 
matrices.  It  is  worth  noting  that  virtually 
all  computational  tasks  on  chordal  graphs 
can  be  carried  out  cheaply. 

A  principal  result  of  [GJSW]  is 
Theorem  1:  Every  partial  positive  definite 
matrix  with  graph  G  has  a  positive  definite 
completion  if  and  only  if  G  is  chordal. 

A  summary  of  the  proof  can  be  found  in 
[J].  We  simply  note  here  that  the  graph  of 
any  banded  partial  matrix  is  chordal  so  that 
this  theorem  gives  a  complete 
generalization  of  conclusion  (i)  above  from 
[DGo].  Note  also  that  the  graph  (2.5)  of  the 
matrix  B  defined  by  equation  (2.4)  is  the 
simplest  non-chordal  graph.  According  to 
Theorem  1,  not  every  partial  positive 
definite  matrix  with  this  graph  is 
completable,  and  the  matrix  B  is  an  example 
of  one  that  is  not.  It  is  typical  of  a  general 
class  of  counterexamples  that  exhibits 
chordality  as  a  necessary  condition  for 
completability. 

Provided  that  it  is  known  there  exists  a 
positive  definite  completion  to  a  partial 
positive  definite  matrix,  conclusions  (ii) 
and  (iii)  of  [DGo]  carry  over  irrgspgCLliv^  of 
the  pattern  of  the  unspecified  entries.  This 
is  another  principal  result  in  [GJSW]. 

Theorem  2.  Suppose  that  the  partial 
positive  definite  matrix  A  has  a  positive 
definite  completion.  Then  there  is  a  unique 
determinant- maximizing  completion  M 
which  is  also  the  unique  completion  whose 
inverse  has  zeros  in  the  positions 
corresponding  to  unspecified  entries. 

3.  Determinantal  Maximization 

For  the  reasons  suggested  in  [B]  and  as  m 
[DGo]  we  call  the  determinant-maximizing 
completion  M  of  a  completable  partial 
positive  definite  matrix  A  the  m  g  x  i  m  g  rn 
entropy  oomnletion  of  A.  An  intriguing 
question  is:  v.’hat  is  the  value  of  det  M  as  a 
function  of  the  specified  entries  of  A.  The 
simplest  example  is  the  case  in  which  only 
the  diagonal  entries  of  A  are  specified.  If  B 
is  the  completion  obtained  by  setting  all 
unspecified  off-diagonal  entries  in  A  equal 

n 

to  0,  then  detB  =  J~Ia..  and  so  by 

Hadamard's  inequality,  B  is  the 
determinant-maximizing  completion  of  A. 
However,  this  simple  example  masks  the 


fact  that  the  key  idea  is  that  B  has  all 
off-diagonal  entries  equal  to  0. 

Three  distinct  ways  to  obtain  the 
determinant  of  the  maximum  entropy 
completion  M  of  a  partial  positive  definite 
matrix  A  have  been  found  [BJL]  whenever  the 
graph  G(A)  is  chordal.  We  summarize  these 
here  and  note  that  the  last  also  allows  one 
to  obtain  the  maximum  entropy  completion 
itself.  In  order  to  describe  these  results  we 
make  another  digression  into  graph  theory. 

Let  G  be  an  undirected  graph  with  vertex 
set  V  =  (1 , 2  , ....  n]  A  nonempty  subset 
CcV  is  called  a  clique  of  G  if  (x,y}  is  an 
edge  of  G  for  all  distinct  x,yeC,  The 
clique  C  is  called  a  maximal  clique  if  C  is 
not  a  proper  subset  of  any  clique.  In  the 
graph  G.,  above,  [1,2,5,],  {2,3},  [3,4]  and 
(4,5)  are  the  maximal  cliques.  Now  let 
C=<C  C  )  be  the  set  of  maximal 

V-  V  y  •••’  rnJ 

cliques  of  G.  The  intersection  qraph  G^.  of  C 
is  the  graph  with  vertex  set  C  and  edge 
set  8  where  { C. ,  C. }  e  £  if  and  only  \i  \ 
and  C.  n  C.  /  0.  A  subgraph  T  of  G^  is  cctlied 
a  s o a  n  n  i  n q  tree  of  G^-  if  T  is  a  tree  (a 

connected  graph  with  no  circuits)  with 
vertex  set  C .  Such  a  tree  T  is  said  to 
satisfy  the  intersection  property  if 

C  n  C  £  C,  whenever  C;,  lies  on 

i  j  k 

the  (unique)  path  from  C|  to  in  T  (IP) 

For  example,  let  G  be  the  graph 


Then  the  maximal  cliques  are  Ci={1.2,3), 
C2=[2,3,5},  C3=(2,4,5}  and  C4={3,5,6},  and 
the  intersection  graph  G^  is 


Then  the  graph 


is  a  spanning  tree  of  satisfying  the 

intersection  property  (IP),  while  the 
spanning  tree  03-0^-02-04,  for  example, 
does  not  satisfy  (IP).  The  intersection 
property  is  a  key  hypothesis  in  several 
papers  on  determinantal  identities  and 
inequalities  [BJ1,  JB,  BJ2].  Its  significance 
in  the  present  context  is  the  following 
fundamental  graph  theoretic  result  ([BJL]). 

Theorem  3.  Let  G  be  a  connected 
undirected  graph;  let  C={G,,G„ . 0  }  be 

the  set  of  maximal  cliques  of  G,  and  let  G^ 

be  the  corresponding  intersection  graph. 
Then,  there  is  a  spanning  tree  of  G^ 

satisfying  (IP)  if  and  only  if  G  is  chordal. 

A  consequence  [BJL]  of  Theorems  1-3  and 
the  theorem  in  section  2  of  [BJ2]  (a  formula 
for  the  determinant  of  a  matrix  based  on  the 
zero  pattern  of  its  inverse)  is: 

Theorem  4.  Let  A  be  a  partial  positive 
definite  matrix  and  assume  that  G(A),  the 
undirected  graph  of  the  specified  entries  of 
A,  is  connected  and  chordal.  Then  if  B  is  any 
positive  definite  completion  of  A, 

m 

ridct  AlC^I 


n  dct  A  (  C  n  C  I 

(Cj.C.)ce(T)  '  ^ 

where  T  is  any  spanning  tree  of  G^ 

satisfying  (IP)  and  C(T)  is  the  edge  set  of  T. 
Furthermore,  equality  is  attained  in  (3.1)  if 
and  only  if  B  is  the  maximum  entropy 
completion  of  A. 

Thus,  the  right  hand  side  of  (3.1)  is  a 
formula  for  the  determinant  of  the 
maximum  entropy  completion  of  A  in  terms 
of  its  specified  entries. 

As  a  simple  example,  suppose  that 


A  = 


11??' 
13  4? 

?  4  6  2 
?  ?  2  1 


and  that  M  is  the  maximum  entropy 
completion  of  A.  Then 


det 


1  1 

3 

4 

6  2 

1  3  1 

4 

.6| 

2  1 

4 


The  sets  n  C.  corresponding  to  the 

edges  {C.  ,C.)e  £(T)  can  be  described  graph 

theoretically,  and  independently  of  T,  as  the 
minimal  vertex  separators  of  G(A)  [BJL]. 

We  have  taken  G(A)  to  be  connected  in 
theorem  4  for  convenience  since  the 
disconnected  case  is  easily  dealt  with  using 
Fischer's  inequality  [HJ,  p.  478]. 

There  is  an  alternative  to  the  right  hand 
side  of  (3.1),  an  "inclusion-exclusion" 
representation  of  the  maximum  determinant. 

Suppose  [hat  A  is  partial  positive  definite, 
G(A)  is  connected  and  chordal,  C  C  are 

the  maximal  cliques  of  G,  and  M  is  the 
maximum  entropy  completion  of  A.  Then 
([BJL]) 

n>lc'A|C|  n  dtlAK'nCnC  | 

del  M-  - UJ - LLllt - _ 

ndclA|C,r,CJ  n  del  AIC^  nCJ 


After  significant  cancellation  the  right  hand 
side  may  be  seen  to  be  the  same  as  the  right 
hand  side  of  (3.1).  However,  this 
formulation  does  have  the  advantage  of 
requiring  only  a  knowledge  of  the  maximal 
cliques  of  G(A). 

A  third  way  to  obtain  the  maximum 
determinant  has  the  added  advantage  that 
the  entries  of  the  determinant  maximizing 
completion  are  directly  calculated  at  the 
same  time.  The  maximum  entropy 
completion  can  be  considered  to  be  the 
solution  to  a  multi-variable  maximization 
problem.  In  the  case  that  G(A)  is  chordal,  it 
is  shown  in  [GJSW]  (in  order  to  demonstrate 
the  sufficiency  in  theorem  1)  that  there 
exi-^ts  a  sequence  of  chordal  graphs  of  G- 

i-0,...,s  such  that  G„  =  G,  G^  is  the 
complete  graph  and  G^  is  obtained  from  G^  , 
by  adding  a  single  edge.  (Such  "chordal 
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orderings"  of  the  edges  missing  from  a 
chordal  graph  are  highly  nonunique.) 
Furthermore,  at  each  step,  there  is  a  unique 
maximal  clique  containing  the  added  edge. 
It  is  therefore  natural  to  consider  a 
sequence  of  one-step  maximization 
problems  in  which  one  selects  the  value  of 
the  entry  corresponding  to  the  new  edge  to 
be  the  one  that  maximizes  the  determinant 
of  the  principal  submatrix  whose  index  set 

is  the  new  maximal  clique.  Each  of  these 
one-step  maximization  problems  is  the 
case,  discussed  at  the  beginning  of  section 
2,  of  picking  the  maximum  determinant 
when  there  is  only  one  unspecified  entry. 
Remarkably,  the  matrix  obtained  at  the  end 
cf  this  sequence  of  one-variable 
maximization  problems  is  the  (unique) 
maximum  entropy  completion  of  A  [BJL, 
JR3],  no  matter  which  chordal  ordering  of 
the  missing  edges  was  chosen.  Thus 
(remarkably),  when  G(A)  is  a  chordal  graph, 
a  several-variable  optimization  problem  can 
be  solved  as  a  sequence  of  (very  simple) 
one-variable  optimization  problems,  making 
it  simple  to  obtain  the  determinant 
maximizing  completion. 

4 .  Determinantal  Inequalities 

There  are  several  attractive  classical 
inequalities  for  the  determinant  of  a 
positive  definite  n-by-n  matrix  A  =  (a..). 

For  example,  Hadamard's  inequality  (1893) 
states  that 


n 

det  A  <  ria 

II 

i-i 

Fischer's  generalization  (1908)  is  that 

det  A  <  det  A[a]  det  [«'^] 

in  which  a  c  {1 , 2 ,  .. .,  n)  is  any  index  set. 
The  further  generalization 


det  A  [(/.  'j  [i] 


de|A  [«]  det  A[|S] 
det  [ ri  (5] 


often  called  "Hadamard-Fischer"  has  also 
been  known  for  some  time. 

During  the  1960's  and  1970's  a  variety  of 
additional  generalizations  due  to  Carlson. 
Fan  and  Marcus  appeared.  Each  of  these  also 


involves  a  right  hand  side  that  is  a  ratio  of 
principal  minors  of  A.  It  is  clear  from 
theorem  4  that  for  any  chordal  graph  the 
right  hand  side  of  (3.1)  gives  such  an 
inequality  for  A.  (As  noted,  the 
connectedness  assumption  upon  the  graph 
may  easily  be  relaxed  to  allow  inequalities 
such  as  Hadamard  s  as  special  cases.)  In 
fact,  essentially  all  such  ratio  inequalities 
may  be  deduced  from  these  "chordal" 
inequalities  [JB]. 
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1 .  INTRODUCTION 

It  is  easy  to  imagine  situations  where  one 
has  to  estimate  the  contour  of  a  region  from 
partial  knowledge  about  that  region.  For  example, 
in  mining  exploration  a  geologist  wants  to 
estimate  the  location  of  an  ore  deposit,  this 
from  observations  made  at  some  points. 

Ripley  and  Rasson  (1977)  consider  a  problem 
of  that  type  posed  by  Professor  D.G.  Kendall.  The 
situation  is  the  following:  given  a  realization  of 
a  homogeneous  Poisson  process  of  unknown  intensity 
within  an  unknown  compact  convex  set  C  C  ,  we 
want  to  estimate  C.  Conditionally  on  the  fact 
that  the  number  of  observed  points  N  is  n,  these 
points  Xj ,  . . . ,  Xn  are  independent  and  uniformly 

distributed  on  C.  Hie  arguments  used  by  Ripley 
and  Rasson  are  conditional  on  the  value  of  N. 
The  proposed  reconstruction  consists  in  a  dilation 
of  the  convex  hull  of  Xj  ,  . . . ,  x„  about  its 
centroid,  this  dilation  being  such  that  the  area 
of  the  reconstruction  is  (approximately)  an 
unbiased  estimation  of  the  area  of  C.  The 
procedure  is  affine  invariant. 

In  the  situation  where  one  has  to  reconstruct 
an  Interval  on  the  line  the  criterion  to  evaluate 
a  procedure  is  clear.  Indeed  if  the  center  and 
the  length  are  correctly  estimated  the  reconstruc¬ 
tion  is  good.  For  a  set  in  R^  the  situation  is 
not  so  simple  because  here  there  is  a  new  element, 
the  shape  of  the  set.  The  shape  cannot  in  general 
be  specified  by  a  finite-dimensional  parameter,  so 
a  criterion  to  appreciate  the  estimation  of  the 
shape,  that  would  lead  to  some  workable  proce¬ 
dures,  is  not  easy  to  find.  Moore  (1984)  proposes 
to  measure  the  precision  of  a  reconstruction 
C  by  m[C  A  CJ ,  the  measure  of  the  symmetric 
difference  between  C  and  C.  There  exists  a 
complete  class  of  solutions  (reconstruction  rules) 
with  respect  to  this  loss  function,  however  these 
solutions  will  in  general  be  difficult  to  obtain. 

In  the  problem  considered  by  Ripley  and 
Rasson  the  observations  come  only  from  the 
interior  of  C.  In  many  situations  information 
coming  from  outside  of  C  will  also  be  available 
(e.g.  in  the  search  of  an  ore  deposit  some 
observations  will  fall  outside  of  the  deposit). 
We  will  formulate  here  a  problem  allowing  to 
Incorporate  this  type  of  situations.  For  the 
pioblc...  considered  a  minimal  sufficient  .statistic 
to  reconstruct  C  is  known  (section  2).  It  is 
however  difficult  to  find  reconstruction  rule, 
based  on  this  statistic,  which  satisfy  a  pertinent 
criterion.  This  is  briefly  explored  in  section  3. 


An  alternative  approach  consists  to  formulate 
reconstruction  algorithms  based  on  the  minimal 
sufficient  statistic,  and  to  evaluate  them  in 
regard  to  some  cr'teria.  Three  such  algorithms 
are  presented  in  section  4.  Seme  results  of  a 
simulation  experiment  designed  to  compare  these 
three  algorithms  are  reported  in  section  5, 

2 .  THE  PROBLEM 

Let  C  be  an  unknown  compact  convex  set  in  R^ . 

Suppose  the  sample  points  Xj . X„  (n  is  given 

but  can  be  the  value  taken  by  a  random  variable) 
are  selected  independently  according  to  a  known 
distribution  function  F  on  R2  whose  support 
includes  C.  For  each  sample  point  it  is  known, 
in  addition  to  its  coordinates,  if  it  is  interior 
or  exterior  to  C.  Based  on  this  information  it 
is  desired  to  reconstruct  (estimate)  C.  Other 

sampling  models  could  be  considered,  see  De  Groot 
and  Eddy  (1983). 

The  sample  space  is 

S-(  (Xi  ,  i,  ,  .  .  .  ,Xn  ,  in  )  :xj  e  R2  ,  ij-0  or  T,j-1 . n) 

where  ij  -  1  if  the  j th  sample  point  is  interior 
to  C  and  ij  -  0  otherwise.  Let  H  be  the  closed 
convex  hull  of  the  interior  sample  points  and  let 
V  be  the  set  of  the  vertices  of  H.  Clearly  H  is 
a  lower  bound  for  C,  An  upper  bound  for  C  is  given 
by  the  union,  K,  of  all  the  closed  convex  sets  Q 
such  that  H  £  Q  and  for  which  all  the  Xj  with 
ij  -  0  are  exterior  to  Q.  De  Groot  and  Eddy 

(1983)  prove  that  K  is  star-shaped  from  the  set 
H,  that  is  if  y  E  K,  z  e  H  and  u  -  ay  +  (l-a)z, 

0  <  o  <  1,  then  u  G  K.  The  set  K  can  be  const- 

tructed  by  noting  that  the  complement  of  K  is 

K  ^  U  ly:  y  -  X.  +  A(x,  -  z) ,  z  s  H,  A  >  0) 
jeE 

where  E  -  ( j :  ij  -  0,  1  <  j  <  n) .  Figure  1  illus¬ 
trates  the  sets  H  and  K.  The  unknown  convex 
set  C  is  such  that  H  £  C  C  K  (with  probability 
one).  Let  T  be  the  set  of  peaks  of  K,  a  peak 
being  a  sample  point  Xj  ,  j  6  E,  such  that  if  Xj 
is  removed  then  K  is  modified.  Hachtel,  Meilijson 
and  Nadas  (1981),  and  also  De  GiCjt  and  Eddy 
(1983)  in  a  more  general  setting,  have  shown  that 
(V,T)  Is  a  minimal  sufficient  statistic  for  the 
family  ( :  C  E  S  )  ,  (J  being  a  class  of  compact 
convex  sets  included  in  the  support  of  F  and 
is  the  probability  measure  induced  on  S  by  F 
given  C.  A  reconstruction  rule  for  C  should  be 
based  on  ^V,TJ.  I'  stems  however  I'fficult  to  find 
such  a  rule  that  would  be  easy  to  implement  and 
that  would  satisfy  an  attractive  criteria.  This 
is  briefly  considered  in  the  next  section. 
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points  (+);  the  set  K  i'-'-  generated  by 
the  exterior  points  (•);  the  peaks  are 

O 


3.  RECONSTRUCTION  RULES 


The  distribution  of 
(Xi  ,  ii  ,  -  ,  i„)  given 


the  raiiuoiu  vcc 
C  is 


n  F(x,)I  (X,) 
j-1  '  C(j)' 


(1) 


and  is  the  indicator  function  of  the  set  A. 

From  (1)  it  is  clear  that  given  the  observations 


(Xl 


in) 


the  maximum  likelihood 


estimator  of  C  is  any  set  C  such  that  H  £  C  C  K. 
So  the  m.I.e.  leaves  much  to  be  chosen. 

As  mentionned  earlier  a  natural  measure  for 
the  accuracy  of  a  reconstruction  C  is  m(C  A  C] 
or  preferably  m[C  A  C)/m(C] .  Given  a  reconstruc¬ 
tion  rule  S,  that  is  a  function  from  S  (more 
precisely  ((V,T)I)  to  the  class  of  the  convex  sets 
considered,  a  criterion  to  assess  S  could  then  be 
R(C,6)  -  E(m[C  A  «(V,T)l/ra(Cl) 
the  expectation  being  with  respect  to  the  distri¬ 
bution  defined  by  (1).  A  good  reconstruction  rule 
would  be  one  for  which  R(C,6)  is  minimal  for  a 
large  class  of  sets  C,  or  one  for  which  the 
maximum  value  of  R(C,4)  over  a  large  class  of  sets 
C  is  minimal.  Unfortunately,  except  for  very 
restricted  classes  of  convex  sets,  it  will  be  very 
difficult  to  obtain  explicitly  such  procedures. 
However,  for  larger  classes  it  is  sometimes 
possible  to  show  that  such  a  procedure  exists. 

Let  A  be  a  (probability)  measure,  considered 
as  a  prior,  defined  on  the  class  C  of  the  convex 
sets  considered.  The  posterior  distribution  on  (? 
is  then  given  by 

A(C) 


G(C|xi  ,ii . x„  ,i„)  - 


dA(D) 

g(V,T) 


if  C  G  (2(V.T) 


(2) 


(  0  otherwise 

where  (?(V,T)  is  the  class  of  sets  C  e  Q  which  are 
compatible  with  the  observations  (see  De  Groot  and 
Eddy  io  illustrate  the  reconstruction 
given  by  the  mode  of  the  posterior  distribution 
(which  maximizes  A(C)  on  (J(V,T)),  we  consider  the 
following  example.  Let  (3  be  the  class  of 


rectangles  with  sides  parallel  to  given  axes. 
This  class  can  be  described  by  the  parameters 
(t,  u,  ri ,  Xi)  where  (t,  u)  are  the  coordinates 
of  the  center  and  ri  ,  I2  the  half  lengths 

of  the  sides.  As  a  prior  we  consider  the  measure 
reflecting  ignorance.  Following  Villegas  (1977) 
this  measure  (the  inner  prior)  is  such  that 

A(t,  u,  ri  ,  r2)<C  l/r^r^  (3) 

(or  1/ri r2 ,  the  outer  prior,  if  prior  indepen¬ 
dence  is  assumed  for  the  center  and  the  lengths, 
but  here  the  final  procedure  will  be  the  same). 
The  rectangle  C  G  g  maximizing  (2)  is  the  rectan¬ 
gle  compatible  with  the  observations  and  such  that 
rj r2  is  minimal,  that  is  the  smallest  rectangle 
inl’.’ding  all  the  Interior  sample  points  Tt  is 
interesting  to  note  that  the  non  Informative  prior 
leads  to  a  reconstruction  using  only  the  informa¬ 
tion  provided  by  the  interior  sample  points. 

In  some  circumstances  it  might  be  possible  to 
estimate  C  by  the  expectation  of  the  posterior 
distribution  (2),  Hachtel,  Mellijson  and  Nadas 
(1981)  briefly  consider  this  possibility. 

4 .  ALGORITHMS 

Since  it  is  in  general  difficult  to  derive,  a 
reconstruction  rule  from  a  general  criterion,  we 
consider  some  empirical  algorithms.  They  are  all 
Hased  on  the  stati.cric  (V.T)  and  are  presented  in 
order  of  increasing  complexity. 

Algorithm  IfAI);  The  centroid  o  of  H  is  determined 
and  the  maximal  dilation  factor,  d,„ ,  of  H  about 
its  centroid,  permitted  by  T,  is  determined.  To 
find  d,„  we  consider  for  each  t  G  T  the  intersec¬ 
tion  ut  of  the  line  ot  with  the  frontier  of  H, 
then 

The  proposed  reconstruction  is 

C  -  [1  a(d„-l))H  ,  0  <  a  <  1 

that  is  the  dilation  of  H  about  its  centroid  by  a 
factor  [1  +  a(d„-l)].  The  parameter  a  is  chosen 
by  the  user,  a  -  0  corresponds  to  estimate  C  by 
H  and  a  -  1  corresponds  to  the  maximal  dilation 
permitted.  To  reflect  Ignorance  the  value  1/2 
could  be  assigned  to  a;  also  the  choice  of  o  could 
be  dependent  on  the  data.  Clearly  C  is  convex, 
H  £  C  C  K,  and  the  procedure  is  affine  invariant. 
Figure  2  illustrates  the  procedure  (a  -  1/2). 
Algorithm  II  CAII):  With  AI  the  dilation  is  the 
same  in  all  directions  (isotropic).  The  informa¬ 
tion  supplied  by  T  may  indicates  some  directions 
for  which  the  dilation  could  be  more  Important. 

All  takes  this  fact  into  account.  I.et  hj .  h, 

he  the  sides  of  H  e-..-'  d  ,  d,j  the  msvi-sl 

dilation  factors  permitted  for  each  side. 
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j  - 


inf 


l°^i.  I 


u  6  hj  ,  Uit  -  Intersection  of 
the  line  ou  with 
the  froiitier  of  K 


, s.  Each  side  is  dilated  to  become 


hj  -[1  +  “j(dnj-l)lhj  ,  0  <  oj  <  1  .  j  -  1 . s 

(if  there  are  not  sufficient  information  to 
determine  the  dilation  factor  for  a  side,  then  the 
dilation  factor  obtained  for  a  neighbor  side  is 
used) .  The  proposed  reconstruction  is  the  convex 
polygon  obtained  by  extending  the  hj s  until  they 
meet  (some  hj  may  be  eliminated  because  they  are 
too  far).  The  role  of  the  Oj s  is  analogue  to  the 
one  played  by  a  for  AI .  By  construction  C  is 
convex  and  H  £  C  £  K.  Since  All  uses  only  dila¬ 
tions  by  factors  that  are  invariant  under  affine 
transformations,  the  procedure  is  affine  invariant. 
Figure  3  illustrates  All  (all  Oj  -  1/2)  applied  to 
the  data  in  Figure  2,  note  that  hg  is  eliminated 
in  the  reconstruction  of  C. 

Algorithm  III  (Allli :  With  AI  and  All,  V  is 
essentially  used  to  dertermlne  the  shape  of  C 
and  T  is  used  to  fix  its  size.  We  may  think  this 
gives  too  much  weight  to  V  in  the  utilization  of 
(V,T).  The  third  algorithm,  which  is  more 
complex,  consider  two  preliminary  estimates  of  C, 
one  being  simply  H  and  the  second  mainly  obtained 
from  T.  The  final  estimate  is  the  average 
(Minkowski  sensei  of  these  •'vq  estimates  The 
hope  here  is  tiiui  the  information  contained  in  T 
will  be  used  more  completely.  We  describe  step  by 
step  (as  those  in  the  program  for  the  simulation 
study)  the  procedure  to  obtain  C.  Only  the  main 


Figure  3:  Algorithm  II  applied  to  the  data  used 
in  Figure  2. 

elements  are  given,  some  complementary  details, 
mainly  about  steps  5  and  8,  can  be  found  in  a 
technical  report  available  from  the  first  author. 
Figure  4  illustrates  AIII  applied  to  the  data  in 
Figure  2 . 

STEP  1.  Draw  a  frame  0,  that  is  a  rectangle 
including  all  the  sample  points. 

STEP  2.  Let  |T|  be  the  cardinality  of  T.  If 
|t|  -  0  there  is  no  exterior  point,  we  could  then 
use,  for  example,  the  Ripley-Rasson  procedure.  If 
|t|  -  1  or  2 ,  points  are  possibly  added  to  T, 
these  are  the  vertices  of  0  not  included  in  K 
(if  there  are  some).  In  the  description  of  this 
algorithm  T  will  denote  the  set  of  peaks  augmented 
as  just  described.  It  is  easy  to  see  chat  T  will 
contain  at  least  two  points. 

STEP  3.  Find  the  convex  hull,  W,  of  the  points  in 
T. 

STEP  4  Determine  if  H  C  W  or  not.  To  do  so  we 
find  the  number  v  of  vertices  of  H  which  are 
interior  to  W.  If  v  -  |v|  then  H  C  W  and  we  go 
to  step  6  (this  is  the  situation  in  Figure  4).  If 
V  -  0  we  determine  if  H  n  W  h  d  or  not.  If  this 
intersection  is  empty  (this  is  not  possible  if  T 
has  been  augmented  in  step  2)  we  add  points  to  T 
as  described  in  step  2  and  find  W  from  that 
augmented  set;  then  H  n  W  ^ .  If  this  new  W 
include'-  H  we  go  to  step  6. 

STEP  5.  If  0  <  V  <  |v|  or  if  W  is  obtained  in 
setp  4  and  does  not  Include  K,  the  set  W  is 
enlarged  to  produce  a  convex  set  H  £  W  having  all 
its  vertices  in  K  or  on  its  frontier. 
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STEP  6.  Let  W  be  the  set  W  if  v  -  |v|  or  the 
set  obtained  after  step  4  or  5.  Consider  the  set 
B  -  1/2 (H  ®  W)  where  ®  denotes  the  Minkowski 
addition  and  the  1/2  contraction  is  with  respect 
to  the  centroid  of  H.  The  set  H  ®  W  is  obtained 
by  finding  the  convex  hull  of  all  the  points  in 

((ai+bj):ai  6  V.  bj  e  T,  i-1 . |v|,  J-1 . |t|i 

where  T  is  the  set  of  vertices  of  W.  The  convex 
set  B  includes  H  and  has  all  its  vertices  in  K; 
however  some  sides  of  B  may  cross  K. 

STEP  7.  Find  if  some  sides  of  B  cross  K.  To  do 
so  we  determine  the  number  u  of  elements  in  T 
which  are  Interior  to  B.  If  u  -  0  the  procedure 
is  terminated  and  the  final  reconstruction  is 
C  -  B  (this  is  the  case  in  Figure  4). 

STEP  8.  If  u>l,B  is  reduced  to  form  a  convex 
set  B'  including  H  and  included  in  K.  The 
final  reconstruction  is  C  -  B' . 


Figure  4:  Algorithm  III  applied  to  the  data  used 
in  Figure  2. 

Because  the  frame  Cl  is  introduced  AIII  does 
not  really  use  only  the  information  provided  by 
the  minimal  sufficient  statistic  (V,T).  However, 
we  want  to  note  that  in  many  cases  (if  the  sample 
size  is  large  enough)  AIII  will  go  through  steps 
3,  4,  6,  7  without  difficulties  (e.e.  Figure  4), 
and  then  the  introduction  of  Cl  (step  1)  is  not 
necessary.  Also,  it  may  happen  that  a  set  O  is 
already  available,  it  will  be  the  case  for  example 
if  the  support  of  F  is  finite  (see  next  section), 
or  if  it  is  known  a  prljri  that  C  is  in  a  given 
bounded  region,  see  Moore  and  Lanlel  (1983)  for  an 
example  in  soil  studies. 


5.  A  SIMULATION  STUDY 

To  compare  the  algorithms  presented  in 
section  4  a  simulation  experiment  was  conducted. 
The  structure  of  this  experiment  is  the  following. 

A  frame  Cl  is  fixed,  this  is  a  10  X  10  square 
(it  will  be  the  one  used  in  step  1  for  AIII) .  A 
polygon  is  drawn  in  Cl,  this  polygon  is  considered 
as  the  unknown  C.  The  sample  points  considered 
are  the  realizations  of  a  homogeneous  planar 
Poisson  process  of  intensity  X  observed  on  Cl.  To 
simulate  these  data  a  number  n  is  generated 
from  the  Poisson  distribution  with  parameter  100  A 
and  then  n  points  are  uniformly  and  independently 
generated  on  Cl.  To  simulate  from  the  Poisson 
distribution  we  used  the  algorithm  3.3  in  Ripley 
(1987)  and  to  simulate  from  the  uniform  we  used  a 
procedure  given  by  Bratley,  Fox  and  Schrage  (1983, 
p.  202).  Each  sample  point  is  classified  as 
interior  or  exterior  to  C.  To  evaluate  the 
quality  of  a  reconstruction  two  criteria  are 
considered,  a  precision  criterion:  m[C  A  CI/m[C] , 
and  a  recovering  criterion:  m[C  n  Cl/m[C] .  It  is 
to  be  remarked  chat 

m[C  AC)-  m[C]  +  m[C)  -  2  m[C  n  C] 
so  m[C  A  C]  can  be  large  and  still  m[C  n  C] 
approximately  equal  to  m[C]  i.e.  C  is  almost 
recovered  but  with  a  much  larger  set  than 
neccessary . 

In  a  first  stage  we  tried  to  determine  the 
main  factors  influencing  the  results.  Three 
factors  were  considered:  the  value  of  A,  the 
proportion  of  Cl  occupied  by  C  and  the  number 
of  sides  of  C.  A  2^  factorial  experiment  was 
conducted,  the  selected  levels  were  A:  0.25,  1; 
proportion:  25%,  75%;  number  of  sides:  4,12.  For 
each  of  the  8  experimental  conditions  250  indepen¬ 
dent  repetitions  were  made.  Relatively  to  both 
quality  criteria  it  was  observed  chat  the  first 
two  factors  are  much  more  important  than  the  third 
and  that  all  the  interactions  are  negligeable. 

To  compare  the  three  algorithms  it  was 
decided,  from  the  above  results,  to  consider  a 
polygon  with  six  sides  for  C  and  to  use  three 
proportions:  25%,  50%,  75%  and  three  values  for 

A:  0.25,  0.6,  1.0. 

For  each  of  the  27  combinations  algorithm  - 
proportion  -  intensity  250  repetitions  were  made, 
the  27  X  250  repetitions  being  independent. 
Tables  1  for  mlC  A  C]/m[C] ,  and  2  for 
m(C  n  C]/m[C) ,  present  the  average  of  the  750 
resuits  tor  each  of  the  2/  situations.  The  number 
in  parenthesis  is  2s/J  250 ,  s^  being  the  sample 
variance . 

Table  1  reveals  that: 

•  For  a  given  proportion,  the  precision  increases 
with  A  (i.e.  when  the  average  number  of  sample 
points  Increases).  The  relative  augmentations 
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Table  1 


Average  of  the  250  values  of  m[C  A  C]/m[C] 


1  Algorithm  I 

1  Algorithm  II  ] 

1  Algorithm  III  1 

A 

Prop . 

.25 

.60 

1.0 

.25 

- 1 

.60 

1.0 

.25 

.60 

1.0 

25% 

.582 

(.022) 

.352 

(.015) 

.261 

(.011) 

.514 

(.022) 

.293 

(.014) 

.204 

(.008) 

.335 

(.011) 

.281 

(.010) 

.215 

(.008) 

50% 

.406 

(.018) 

.230 
( .009) 

.162 

(.006) 

.349 

(.017) 

.168 

(.007) 

.123 

(.005) 

.203 

( .008) 

.134 

( .004) 

.109 

(.003) 

75% 

.335 

(.014) 

.172 

(.006) 

.124 

(.005) 

.262 

(.013) 

.133 

(.006) 

.091  i 
(.004); 

.155 

(.007) 

.090 

(.003) 

.065 

(.002) 

Table  2 


Average  of  the  250  values  of  m[C  n  C]/m[C] 


Algorithm  I 

Algorithm  II 

Algorithm  III 

A 

Prop. 

.25 

.60 

1.0 

.25 

.60 

1.0 

.25 

.60 

1.0 

25% 

.434 

(.023) 

.658 

(.015) 

.746 

(.011) 

.550 

(.026) 

.751 

(.015) 

.826 

(.009) 

.877 

(.011) 

.948 

(.005) 

.953 
( .004) 

50% 

.604 

(.018) 

.775 

(.010) 

.841 

(.006) 

.676 

(.019) 

.255 

(.007) 

.895 

(.005) 

.866 

(.011) 

.944 

(.004) 

.964 

(.003) 

75% 

.668 

(.015) 

.832 

(.007) 

.878 

(.005) 

.744 

(.013) 

.879 

(.006) 

.920 

( .004) 

.868 

(.009) 

.940 

(  .004) 

.965 

(.002) 

are  more  Important  with  AI  and  All. 

•  For  a  given  X,  the  precision  increases  with  Che 
proportion.  When  the  proportion  is  augmented 
and  X  kept  fixed,  a  larger  proportion  of  the 
sample  points  is  interior  to  C.  Then  H  is  a 
better  approximation  to  C  (better  estimation  of 
the  shape).  Also,  since  it  is  known  here  that 
C  is  in  0,  the  precision  gained  from  inside  is 
not  canceled  by  the  diminution  of  the  number  of 
sample  points  outside  C.  The  fact  that  AIII 
may  use  0  explains  the  relatively  more 
important  increase  there. 

•  The  comparison  of  the  algorithms  by  pairs,  for 
a  given  proportion  and  a  given  1,  indicates 
that  AIII  does  always  better  than  AI  with  some 
important  differences;  All  is  always  better 
that  AI;  AIII  is  never  Inferior  to  All  and  when 
A  is  small  AIII  is  much  better  than  All. 

•  There  is  an  important  variation  among  the 
sample  variances.  The  variability  is  more 
Important  when  A  or  the  proportion  is  small. 
The  situation  is  similar  for  AI  and  All  but 
AIII  appears  to  be  more  stable. 

Concerning  the  recovering  criterion,  from  Table  2 

we  observe 


•  With  AI  and  All,  again  better  results  are 
obtained  when  a  larger  proportion  of  the  sample 
points  are  interior  to  C.  However,  AIII  seems 
more  stable  in  that  regard. 

•  The  comparison  of  the  algorithm  by  pairs  shows 
that  All  is  always  better  than  AI  and  AIII 
always  better  that  All.  When  A  or  the 
proportion  is  small  AIII  does  much  better. 

•  The  remarks  made  about  the  variability  in 
regard  to  the  precision  criterion  also  apply 
here . 

From  Table  1  and  Table  2  it  is  easy  to  obtain 
the  average  of  the  ratio  m[C)/m[C]  which  indicates 
how  accurately  m[C]  estimates  m[C] .  We  observe 
that  AI  and  All  underestimate  m[C).  This  suggests 
that  it  could  have  been  advantageous  to  take  the 
q's  larger  than  1/2,  mainly  when  the  number  of 
sample  points  is  small. 

To  see  how  each  of  the  factors:  algorithm, 
proportion  and  intensity,  contributes  to  explain 
the  variation  among  the  results,  the  ANOVA  tables 
corresponding  to  a  three-way  layout  model  were 
computed  and  then  the  percentages  of  variation 
explained  by  each  factor  and  the  interactions  were 
obtained  (Table  3). 

3 


Table 


Variation  explained  by  each  factoi 


Criterion 

Algorithm 

(A) 

Proportion 

(P) 

Intensity 

(I) 

AxP 

Axl 

Pxl 

AxPxI 

ERROR 

m(C  A  C)/m[C] 

10.5 

24.0 

29.5 

0.0 

3.0 

1.0 

0.0 

32.0 

m[C  n  C]/mlC] 

28.5 

7.0 

25.0 

4.0 

3.0 

0.0 

0.5 

32.0 
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We  remark  that  for  each  criterion  the  factors 
considered  leave  a  large  part  of  the  variation 
unexplained  (error) .  Due  to  the  geometrical 
character  of  the  problem  it  seems  difficult  to 
determine  factors  that  would  be  easy  to  formulate 
and  that  would  explain  a  larger  part  of  the 
variation.  Indeed,  we  have  noticed  that  for  a 
given  C,  a  given  A,  and  a  given  algorithm,  the 
results  obtained  for  different  samples  were  often 
very  different,  this  being  simply  due  ot  the 
different  position  of  the  sample  points;  however, 
this  variation  was  less  important  with  AIII. 

6 .  CONCLUSION 

To  reconstruct  a  convex  set  C  from  sample 
points,  some  being  interior  and  the  others 
exterior  to  C,  a  minimal  sufficient  statistic  is 
known.  However,  a  reconstruction  rule  based  on 
this  statistic,  giving  an  optimal  reconstruction 
relatively  to  an  appealing  criterion,  is  not  in 
general  easy  to  find.  It  is  possible  to  formulate 
algorithms  that  are  easy  to  apply  and  have 
acceptable  performances.  We  have  proposed  three 
such  algorithms.  They  are  the  result  of  many 
trials  and  partly  motivated  by  the  desire  to 
consider  simple  methods.  Clearly  many  other 
suggestions  could  be  made. 

When  a  reconstruction  is  evaluated  two  points 
of  view  can  be  adopted.  We  may  be  satisfied  if  we 
recover  C,  that  is  if  ra(C  n  C]/m(C)  is  near  one, 
or  we  may  be  more  severe  and  want  that  C  be  C, 
that  is  we  want  m[C  A  C)  near  zero.  In  the  first 
situation  the  choice  of  the  algorithm  is  impor¬ 
tant,  among  those  considered  AIII  gives  good 
results  independently  of  the  sample  size  and  of 
the  relative  size  of  C.  In  the  second  situation, 
since  we  demand  much  more,  it  seems  that  the 


sample  size  plays  the  dominant  rolg.  However,  AIII 
gives  acceptable  results  even  for  moderate  sample 
sizes.  When  the  sample  size  is  Important  it  takes 
much  more  time  to  apply  AIII  than  it  takes 
to  apply  All .  One  may  then  think  that  the  gain 
is  not  justified. 
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APPLICATIONS  OF  ORTHOGONALIZATION  PROCEDURES  TO  FITTING  TREE -STRUCTURED  MODELS 
Cynthia  0.  Slu,  Johns  Hopkins  University 


Orthogonallzation  is  an  important  concept  in 
computations  for  linear  model.  In  this  paper, 
applications  of  Givens  rotations  and  Modified 
Gra.i  Tcl'jsidt  orthogonalizations  to  tree- 
structured  regressions  are  discussed.  The 
resulting  procedure  generalizes  CART's  piecewise 
constant  tree  model  to  piecewise  linear  model. 
Great  versatility  is  offered  by  this  approach; 
regression  tree  models  for  quantitative  and 
binary  data  can  be  handled  by  one  general  fitting 
procedure.  In  addition,  it  provides  a  basis  for 
implementing  various  linear  and  tree-structured 
regression  methods  under  one  framework. 

1.  INTRODUCTION 

Breiman  et  al.  (1984)  and  Friedman  (1979) 
described  a  tree -structured  approach  to  non- 
parametric  multiple  regression.  Their  methods  use 
a  hierarchy  of  piecewise  constant  functions  or 
piecewise  linear  functions  to  approximate  the 
regression  surface.  For  these  tree  models,  the 
predictor  space  X  is  partitioned  recursively  into 
rectangular  subregions,  perpendicular  to  the 
original  coordinate  axes.  A  separate  constant  or 
linear  model  is  fit  to  the  subgroup  of  data 
points  lying  in  each  of  the  subregions  obtained. 

In  the  author's  unpublished  Ph.D.  thesis  at 
the  University  of  Toronto,  we  propose  a  different 
tree  -  structured  fitting  procedure  for  piecewise 
linear  model  (Siu  and  Andrews,  1985).  The  method 
is  based  on  a  natural  extension  of  Modified  Grara- 
Schraidt  Orthogonal izat ion  process  in  linear  least 
squares  method.  Unlike  the  model  by  Friedman 
(1979),  the  hierarchy  of  piecewise  linear  models 
is  built  by  adding  one  predictor  variable  at  a 
time  to  the  local  models,  linearly  adjusted  for 
the  effects  of  those  already  included. 

Using  this  orthogonallzatlon  approach, 
recursive  partitioning  is  performed  on  residual 
predictor  variables  rather  than  on  original 
variables.  Data  points  are  grouped  according  to 
a)  the  nature  of  relationships  among  predictor 
variables,  and  b)  the  relevance  of  these  vari¬ 
ables  to  the  response  y.  The  resulting  recursive 
partitioning  procedure  is  more  general  than  the 


rectangular  splits  in  previous  methods.  General¬ 
ly,  associations  are  found  among  predictor 
variables  in  real  data.  In  the  special  case  of 
having  totally  unrelated  predictor  variables,  the 
splits  performed  on  residual  variables  will  be 
the  same  as  the  univariate  rectangular  splits. 
This  method  provides  a  simple  solution  to  the 
otherwise  difficult  problems,  such  as  detecting 
linear  structures  that  are  separated  by  hyper¬ 
planes  not  perpendicular  to  the  coordinate  axes. 

In  addition,  this  orthogonalization  approach 
allows  tree-structured  models  to  be  built  and 
interpreted  within  the  familiar  linear  regression 
framework.  The  usual  selection  procedures  for 
stepwise  regressions  can  be  used  for  choosing 
variables  and  splitting  values  in  this  m''thod.  In 
the  absence  of  recursive  partitioning  operations, 
this  procedure  is  identical  to  fitting  a  specifi¬ 
ed  linear  regression  by  forward  stepwise 
approach . 

This  paper  describes  applications  of  orthogo¬ 
nalization  procedures  to  fit  tree-structured 
models.  The  analogy  of  this  approach  to  classical 
least  squares  methods  opens  many  possibilities  to 
generalize  the  existing  tree-structured  methodo¬ 
logy.  Some  of  them  will  be  discussed  here.  In 
particular,  the  framework  can  be  used  for 
developing  tree  -  structured  extensions  of 
Generalized  Linear  Models  (GLM)  (Nelder  and 
Wedderburn,  1972).  As  compared  to  Generalized 
Additive  Models  (Hastie  and  Tibshirari,  1985), 
this  recursive  partitioning  approach  uses  a 
hierarchy  of  piecewise  linear  functions  to 
generalize  the  linear  predictor  function  in  GLM. 

Givens  rotations  provide  the  basic  algorithm 
for  computing  these  tiee  models.  The  proposed 
fitting  algorithm  is  flexible.  It  can  be  organiz¬ 
ed  to  fit  a  wide  class  of  parametric  and  non- 
pararaetric  hierarchical  models  within  one 
framework.  This  Includes  as  special  cases 
stepwise  procedures  for  generalized  linear 
models,  as  well  as  the  standard  recursive 
partitioning  procedure  for  piecewise  constant 
model  (Breiman  et  al . ,  1984)  and  piecewise  linear 
model  (Friedman,  1979).  A  simple  model  speclfica- 
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tion  rule  is  proposed.  It  helps  clarify  the 
properties  and  uses  of  these  procedures. 

In  Section  2,  we  stait  with  a  simple  example 
of  the  standard  tree  model  to  illustrate  the 
basic  ideas  behind  tree-structured  methodology. 
Section  3  describes  the  linear  adjustment 
approach  to  fit  a  tree-structured  normal  response 
model.  Tills  method  differs  from  the  one  by 
Friedman  (1979)  in  several  important  ways,  will 
also  be  discussed.  Section  4  shows  how  this 
approach  can  be  used  to  develop  the  tree- 
structured  extension  of  generalized  linear 
models.  Section  5  describes  the  model  specifica¬ 
tions  for  the  general  class  of  hierarchical 
fitting  procedures  considered  here. 

2.  BINARY  REGRESSION  TREES 
In  regression  tree  methods,  data  are  recur¬ 
sively  partitioned  into  smaller  subgroups  to 
build  a  hierarchy  of  piecewise  models.  A  separate 
local  model  is  fitted  to  the  subgroup  of  data 
points  Gjj  lying  in  each  subregion  Sj^  of  the 
predictor  space  X  (Figure  2.1). 

Collection 

Tree-Representation  Size  of  Nodes 


Following  the  notation  in  Friedman  (1979), 
each  node  k  represents  a  triple  (Sj^,  Gj^, 
where 

S|(  -  a  subregion  of  predictor  space  X, 

Gjj  -  a  subgroup  of  data  points  lying  in 
subregion  Sj^,  and 

Lk  -  a  1  ocal  raod»l  to  be  applied  to  Gk- 

Each  tree  model  consists  of  layers  of  nested 
nodes,  where  layer  Is  defined  as  a  collection  of 
nodes  in  one  level  of  a  tree  (Siu,  1983). 

Figure  2.1  illustrates  a  simple  example  of  the 


standard  tree -structured  model.  Starting  from 
node  1,  each  nonterminal  node  j  is  split  into  two 
child  nodes  recursively.  For  each  predictor 
variable  (subscripted  v)  ,  and  for  each  value 
(subscripted  c)  of  this  variable,  the  set  of  data 
points  Gk  lying  in  subregion  Sk  is  divided  into 


two  parts  S,  ,,  .  and  S  ,,  . 

l(k)  r(k) 


one  to  the  left  of 


this  value  and  the  other  to  the  right: 
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3)  an  obigctive  function  to  choose  the 
optimal  partition  of  parent  node  j . 

Together,  they  form  the  basis  of  tree-structured 
methodology  for  regression  problems.  There  are 
other  important  Issues  such  as  the  selection  of 
optimal-sized  subtree.  Different  approaches  have 
been  proposed  for  this  problem.  They  can  be  found 
in  Sonquist  and  Morgan  (1964),  Breiman  et  al. 
(1984),  Siu  (1985)  and  Loh  and  Vanichsetakul  (to 
appear  in  JASA) . 


(EQN  3.2a) 
and 


cth  ordered  value  of  the  residual  variable 

subscripted  v,  i.e.  x,  .  .  Then, 

(c),v;(k-l) 

G, , . .  in  S, , . ^  -  data  in  Gi  (EQN  3.2b) 

1(J)  1(J)  J 


'l(j) 

(Xk) 

®l(j)^ 

’'l(j)^v: 

(k-1) ’ 

■r(J) 

(Xk) 

s  , , ,  1  + 
r(J) 

''r(j)*v: 

(k-1) 

KJ) 

and 

G^^^^  respectively.  Let 

whose  X  ,,  ,,<x,,  ,,  ,,, 

v;  (k-1)  -  (c) ,v: (k-1) 


(EQN  3.2b) 

,  .  end 


G  in  S  . . . 
t(j)  r(j) 


data  in  G< 


whose  X  ,,  ,,>x,.  ,, 

v: (k-l)  (c) ,v: (k-1) 


3.  FITTING  REGRESSION  TREE  BY  ORTHOGONALIZATION 
3.1  Review  of  Linear  Regression 

Hosteller  and  Tukey  (1977)  gave  an  illumina¬ 
ting  discussion  of  using  linear  adjustment 
approach  to  fit  a  specified  regression  by  stages. 
The  procedure  is  identical  to  Modified  Gram- 
Schmidt  Orthogonallzation  process  1"  least 
squares  method.  Specifically,  the  method  proceeds 
by  sequentially  sweeping  out  the  effect  of  each 
predictor  variable,  and  iteratively  orthogonaliz- 
ing  the  response  y  and  the  remaining  variables  to 
the  variable  swept.  Let  Y.  residua’  of 

response  y  orthogonalized  to  variable  subset 
At  step  k  (k  -  1  to  p) ,  variable  x  subscripted  k* 
is  added  to  the  model  L(Xj^  j^)  by  two  operations: 


FIT-  L(Xk)  - 


’^k,p+l*k*;  (k-l|  ' 


SWEEP;  X 


'v:{kl  v;(k-l} 

'^k,v*k*.  (k-l)  ' 


(EQN  3.1) 


V  k*. 


The  added  variable  plot  (Cook  and  Ueisberg,  1982) 
for  x^^  is  shown  in  Figure  3.1  (Dotted  line). 


The  optimum  splitting  value  (c*,v*)  is  selected 
after  screening  all  possible  cuts  (c,v)  of  node  j 
(c-  1  to  size  of  subgroup  Gj  ;  and  v  -  1  to  p)  . 
This  systematic  screening  procedure  is  applied  to 
each  nonterminal  node  j,  until  the  full  sized 
tree  model  is  obtained. 

The  sweeping  operation  in  (EQN  3.1)  is 

performed  on  variables  y  ,,  ,,  and  x  ,,  ,,  (v 

; Ik-1)  v: (k-1) 

k*)  separately  for  subgroups  G^^  and  •'2j+l- 
all  the  data  points  shown  in  Figure  3.1  represent 
subgroup  Gj  lying  in  subregion  Sj .  The  piecewise 
linear  function  of  *!(*.(](  i)  (solid  line) 

represents  the  partial  leverage  residual  plot  for 
Xk  on  S^j  and 

3.3  "Outlier"  Detections 

Stratifying  on  j  isolate  high 

leverage  data  points  indicated  by  extreme  values 

of  X  Let  N  .  denote  the  minimum  group 

v;(k-l)  min  ^ 

size  to  fit  the  linear  functions,  L, , , ,  or  L 

Kj)  r(j)' 

in  (EQN  3.2a).  The  first  and  last  few  cuts  (c,v) 

of  each  variable  x  (v  -  1  to  p)  can  be 

used  to  check  the  presence  of  "outliers"  by 
fitting 


3.2  Extensions  to  Tree -Structured  Regression 

To  apply  this  linear  adjustment  approach  to 
fit  tree-structured  piecewise  linear  model,  the 
linear  regressions  of  y  ,,  ,  ,  and  x  ,,  ,  .  (v  / 

k*)  on  jj  1^0  (EQN  3.1)  is  replaced  by 

piecewise  linear  functions. 

In  particular,  in  splitting  node  j,  the  cut  at 
(c,v)  is  defined  by  fitting  a  piecewise  linear 
function  (L...  ,L  )  on  Si.  That  is. 


4(j)^*k-i^  "  ^j^*k-i^ 
^r(j)^*k-l^  "  ^j^*k-l^  ®r(i)^ 


for  c  <  N  ,  , 
min 


N^in  respectively. 


(EQN  3.3) 


and  c  >  Ni 


3.4  Force-to-enter  Variables 

Within  this  framework,  local  models  Lj  (X^  ^^ ) 
in  (EQN  3.2a)  can  be  extended  to  include  q  force- 
to-enter  variables  Zq  .  Let  Mj  (Xj^  ^*k 

j)  be  the  local  models  on  subregion  Sj .  Then  from 
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(EQN  3.2a),  M2j(Xk.Zq)  and  can  be 

derived  from  Lj  j^)  follows. 


ASStSSI^JC  7.=<1-ATM1  fJT-  COVAf^JATI  J  fJ  1 1  f^AC  T  ;  ! 


MjjUk.Zq)  -  Lj(X^.i)  +  s^jl  +  (EC 

’^2j*k*:  (k-1)  *  (k-l|*^2j  ' 


(EQN  3.4) 


«2jAl(*k  Zq)  -  I.j(Xic.i)  >  -2J+1^  " 

'^2J+l*k*:  (k-1)  (k-l)’'2j  +  l' 


r.  ?  L_ 


In  this  model,  stratifications  are  performed 

on  residual  variables  (v  -  1  to  p-q) 

whose  effects  are  to  be  removed,  but  not  on 

variables  z  (u  -  1  to  q)  whose  effects  are 

u:  (k-1 ) 

to  be  estimated.  £2j  represents  the  estimated 
effects  of  the  force  -  to  -  enter  variables  Zq  in 
subgroup  G^j ,  linearly  adjusted  for  local  effects 
of  the  k-1  variables  X^  selected  from  previous 
splits  . 

Using  this  formulation,  a  fixed  set  of 
coefficient  parameters  t  for  2q  will  be  estimated 
for  each  subgroup  obtained.  For  this  model,  Siu 
(1985)  proposed  an  objective  function  to  choose 
splits  which  commensurate  the  bias  with  the 
variance  of  the  estimates  £. 

This  force-to-enter  option  can  be  particularly 
useful  in  some  applications.  Consider  the  problem 
of  applying  recursive  partitioning  regression  to 
analyze  prospective  randomized  studies;  the 
resulting  tree  model  would  be  difficult  to 
interpret  when  stratification  is  also  performed 
on  treatraen  variables. 

3.5  Interpretations 

The  coefficient  parameters  in  the  hierarchy  of 
piecewise  linear  model  are  Interpretable, 
allowing  graphical  assessment  of  nonlinearity  in 
the  data.  For  example,  one  can  plot  the  individ¬ 
ual  effect  of  Zu  (u  -  1  to  q)  against  Xv  (v-  1  to 
p)  to  assess  interactions  between  these  varia¬ 
bles.  The  interaction  plot  shown  in  Figure  3.1  is 
obtained  by  plotting  the  estimated  coeffi¬ 

cient  parameter  of  Zy  for  individual  1,  versus 

X  (i  -  1  to  size  of  entire  training  sample,  v 
i .  V 

-  1  to  p) . 

Distributions  of  variables  in  each  subgroup 

provide  information  on  the  compositions  of  these 
optimal  partitions. 


.  -y'  J 


-10  12  ? 


Fig.  3.1  True  Model  is  y  =  xixzz  +  e 

3.6  Friedman's  Model 

Friedman  (1979)  presented  an  interesting 
approach  to  build  tree-stru< tured  piecewise 
linear  models.  Specifically,  a  global  multiple 
regression 


1-1  (X  ^ ..  )  -  UpSp 


(EQN  3.5a) 


is  fit  to  G]^  in  Sj^,  the  entire  training  sample. 
The  effects  of  subsequent  splits  are  to  modify 
coefficients  of  these  p  variables  one  at  a  time. 
That  is,  in  splitting  node  j.  the  coefficient  of 
is  modified  by  fitting 


L^.iXp)  -  Lj(Xp)  +  s^jl  +  r^jX^,  (EQN  3.5b) 
^2j-H^*P^  "  ®2j  +  ll  '^2j  +  l*v* 


°2j  ®2j 


data  in  Gj  whose  (EQN  3.5c) 

*  <  X  /  ...^  1  j. . 

V*  -  (c*) , k* 


^2j+l  ^2j+l  ”  rl®ta  points  in  Gj  whose 


*v*  ^  ^(c*) , k* 


respectively . 

As  shown  in  (EQNs  3.2  and  3.5),  the  key 
differences  between  the  two  tree  growing  algori¬ 
thms  are  choices  of  L]^  and  the  orthogonalization 
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process  applied  to  the  predictor  variables  X. 
These  lead  to  some  fundamental  differences 
between  the  two  methods:  1)  the  parent  model  Lj 
In  this  method  Is  not  nested  within  the  child 
models  L^j  or  2)  rectangular  splits  are 
used  here  to  group  data  points,  via  stratifica¬ 
tion  on  the  original  (EON  3.5b)  instead  of  the 
residual  (EQN  3.2b)  variables,  and  3)  all  of  the 
predictor  variables  are  used  for  stratification 
in  this  formulation. 

The  orthogonalizatlon  approach  (EQN  3.2)  is  a 
natural  extension  of  least  squares  method.  It 
provides  a  simple  framework  to  develop  tree- 
structured  methods  for  estimating  iterative 
weighted  least  squares  models.  To  build  local 
models  Lj  in  a  forward  stepwise  manner  has  an 
added  advantage.  The  score  test  in  CLM  provides 
the  theoretical  basis  for  the  splitting  rule 
developed  in  Section  4.  From  (EQN  3.5),  such  an 
extension  does  not  seem  to  be  feasible  without 
major  modifications  of  the  fitting  algorithm. 

4.  GENERALIZED  REGRESSION  TREE  MODELS 
4 . 1  Brief  Review 

A  generalized  linear  model  in  Nelder  and 
Wedderburn  (1972)  is  defined  by  three  components: 
the  error  structure  given  by  one-parameter 
exponential  family  of  the  response  variable  y. 
the  linear  predictor  L(Xp),  and  the  link  function 
g(iip)  (i.e.  y  -  iip  +  i,  and  g(iip)  -  L(Xp)  -  rj,!  + 

rixi  -t  .  .  .  e  tpXp)  . 

For  this  model,  Peduzzi  et  al ,  (1980)  suggest¬ 
ed  to  use  the  score  test  (Rao,  1973)  for  stepwise 
selection  of  variables.  Full  iteration  is 
required  only  when  the  selected  variable  x 

k* 

enters  the  model  h(X^  ^^ )  ,  but  not  for  screening 
competing  models.  Preglbon  (1982)  showed  that  the 
score  test  in  CLM  can  be  computed  using  a  normal 
linear  model  setup.  In  particular,  the  score 
statistic  for  Hq  :  r^^  -  0  in  L(Xj^)  is  given  by  the 
additional  regression  sum  of  squares  due  to  xj^  in 
the  weighted  least  squares  regression  of  y*  on  Xi5 
(EON  4.2).  That  is,  for  e  =  N(0,V  ^)  and  V  - 
var(y)  , 

*  Vk:|k-1)  *  *  <EQN  4.2) 

-  Ll(Xk)  t  e 


with  the  weight  and  "working  response  variable" 
y*  -  L(X^  1^  ^  ^k  1^  evaluated  at  the  mle 
r^_l  from  L(X^_^). 


4.2  Tree- structured  Extensions 

This  section  discus.ses  the  use  of  optimal 
tree-structured  approach  for  one-parameter 
exponential  family  models.  The  generalizations 
discussed  here  exploit  the  close  connections 
between  the  stepwise  approaches  to  fit  least 
squares  regression  (EQN  3.1)  and  tree  -  structured 
normal  response  model  (EQN  3.2a).  The  two 
procedures  for  adding  variables  to  the  models  are 
identical  except  the  recursive  partitioning 
Cyerations  that  lead  to  the  fitting  of  regression 
tree  models. 

As  shown  by  Nelder  and  Wedderburn  (1972),  the 
maximum  likelihood  estimates  of  CLM  can  be 
obtained  through  iterative  weighted  least  squares 
method.  Using  the  orthogonalizatlon  approach,  the 
method  discussed  in  Section  3  can  be  readily 
applied  to  build  tree-structured  exponential 
family  models.  Computation  efforts  can  be  sav'ed 
by  lising  an  objective  function  analogous  to  the 
score  test  in  GLM  to  choose  optimal  partitions  of 
node  j , 

Following  the  formulations  in  Section  3.2,  the 
cut  at  (c,v)  is  defined  by  replacing  (EQN  4.2) 
with  a  piecewise  weighted  least  squares  regres- 

y*(k-n  *k:(k-ll' 

fit  ting 


1 1 


(X,. 


r(j)'  k) 


1(J)  v: (k-1  I 


+  ( EQN  4.3) 

and 

+  f  X 

r(j)  v; (k-1  1 


to  G.  in  S, . . .  nnd  G  ^  .  in  S  respective- 

'(J )  Kj  >  r(j )  r(j ) 

Iv-  G,,,..  S  G  and  S  are  defined  as 

Kj)  Kj)  r(j)  r(j) 

in  (EQN  3.2b).  It  is  obvious  that  (EQN  3,2a)  is  a 

special  case  of  (EQN  4.3),  where  l}^  and  L*^ 

Kj'i  r(j) 

represent  the  one-step  approximations  to  the  rale 

of  L  and  L  , . , . 

Kj)  r(j) 

As  in  stepwise  procedures  for  CIM .  iterations 
will  be  performed  to  obtain  tlie  mle  of  ttie 
selected  models.  L^jfXp)  on  S^j  and  L^j^j(Xp)  on 
^2j»l'  Xk  -  1  Xj^  J,  The  optimvim  split 

at  (c*,k*)  is  detined  hv  tl.e  piecewise  linear 
model  iL^^.Lj.^jl  which  viilds  the  largest 
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additional  sum  of  squares  due  to 

(EQN  4.3).  This  method  is  identical  to  the  usual 

stepwise  procedure  for  GLM,  if  no  splitting  is 

performed. 


4.3  "Outlier”  Detections 

Using  this  framework,  potential  outliers  at 

extreme  values  of  x  ,,  ,,  may  be  detected  by 

v:  (k- 1 ) 

fitting  the  weighted  least  squares  regressions  of 


'KJ 

(j 


L 

r 


+ 

+ 


Kj) 

(j) 


s 

r 


(EQN  4.5) 


to  G,  .  . .  when  c  <  N  .  ,  and  to  G  when  c  >  Ni 

l(j)  -  min  r(j)  J 

-N  .  . 
min 


4.4  Givens  Rotations 

The  success  of  this  computation  intensive 
method  depends  on  a  reliable,  efficient  algorithm 
to  compute  and  update  the  weighted  least  squares 
models.  As  the  cut  moves  from  (c,v)  to  (c,v+l^, 
one  data  point  is  added  to  then  removed 
from  h  ,  .  .  Givens  rotations  provide  the  basic 
computation  method  to  update  the  piecewise  linear 
model  (L, ,..,L  in  screening  the  candidate 
splits.  Tne  method  is  designed  to  Identify 
extreme  values  and  exclude  them  from  further 
analysis.  This  may  help  Improve  the  stability  of 
the  algorithm  for  deleting  data  points. 


5.  DISCUSSIONS  AND  CONCLUSIONS 
This  paper  describes  the  basic  methodology  for 
using  orthogonalization  approach  to  fit  a  wide 
class  of  tree  -  structur<'d  models. 

The  versatility  of  this  orthogonalization 
approach  is  illustrated  by  the  large  class  of 
parametric  and  nonpararaetric  hierarchical  models 
that  can  be  fit  within  this  framew%,rk.  A  model 
specification  rule  is  developed  to  clarify  the 
property  and  uses  of  these  procedures.  It  also 
helps  explain  differences  among  the  tree-struc¬ 
tured  models  in  this  class  and  their  connections 
with  the  usual  ii-uar  models. 

In  particular,  procedures  are  specified  bv: 

1)  a  split  indicator  (yes/no)  for  determin¬ 
ing  whether  recursive  partitioning  i.s 
conducted  or  not; 


3)  an  error  structure  (eg.  normal,  binary 
etc  . )  ; 

4)  types  of  predictor  function  (constant, 
linear) ;  and 

5;  an  objective  function  for  choosing  the 
optimal  partition  of  each  node  (eg.  score 
test,  mean  square  error  of  the  estimates  £ 
...  etc . ) . 

To  enhance  flexibility  of  the  procedure,  three 
options  are  included.  User  can  specify  1)  force- 
to-enter  variables,  2)  prior  weights,  and  3)  use 
of  quantile  splits.  Prior  weights  can  be  used  to 
obtain  test  sample  error  estimate . which  we 
do  not  have  space  to  discuss  here  include 
practical  problems  of  applying  optimal  strati¬ 
fications  to  discrete  response  models,  choices  of 
objective  function  for  choosing  the  "best"  split 
of  each  noue ,  design  of  effective  output, 
descriptions  of  intermediate  results,  strategies 
of  finding  optimal  -  sized  trees  ...  etc.  For  the 
normal  response  models,  some  of  these  points  have 
been  studied  in  Siu  (1985) .  They  provide  the 
basis  for  developing  the  tree-structured  exten¬ 
sions  of  generalized  linear  models  in  this  paper. 
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A  STOCHASTIC  EXTENSION  OF  PETRI  NET  GRAPH  THEORY 
Lisa  Annebere.  Wavr.e  State  University 


INTRODUCTION 

A  tool  ot  rising  importance  in  the 
area  of  computer  software  analysis  is 
the  Petri  Net,  Petri  Nets  were 
developed  in  1962  by  Carl  Adam  Petri  in 
West  Germany.  Since  then.  many 
applications  and  methods  of  analysis 
have  been  proposed  by  Petri  and  other 
authors.  Petri  Nets  are  used  in 
modeling  of  a  system,  and  then  for  the 
system’s  subsequent  analysis.  Petri 
Nets  have  some  distinct  advantages  over 
graphical  :.r  other  modeling  and  analysis 
techniques.  most  particularly  the 
ability  to  depict  concurrency  and 
parallelism.  Also,  Petri  Nets  allow 
mcdeling  at  different  levels  of 
abstraction,  further  extending  their 
usefulness . 

It  is  proposed  to  extend  Petri  nets 
to  include  elements  of  stochastic 
behavior  and  t:  utilize  these  Petri  nets 
for  practical  examples.  Some  elements 
;f  net  theory  [1]  may  not  be  applicable, 
but  the  ccntr  ib'Ution  of  improved 
graphical  representation  and 
reachability  tree  analysis  is 
: ; ns iderable .  The  stochastic  behavior 
postulated  answers  "with  what 
probability  will  the  nodes  in  this  path 
function'’"  ;  given  the  nC'n- determinancy 
of  firing  rules,  an  essential  element  in 
Fetri  fiet  thecrN”. 

PETRI  NET  DEFINITIONS 

Petrc  r.'ets  have  an  outstanding 
advantage  of  the  atdlitv  t;  sh;.w 
parallelism  or  ocnourrent  svstems.  in 
addition  t;  shC'Wing  elements  of 
.ontrol.  This  is  an  especially  useful 
advantage  when  discussing  computer 
hardware  of  LSI  or  greater  complexity 
I  as  it  is  cr  should  be  highly  parallel). 

Petri  Nets  are  a  bipartite  graph 
capable  of  modeling  well  a  wide  variety 
of  situations.  The  two  types  of  nodes 
are  palaces  (represented  by  circles)  and 
transitions  (represented  by  bars).  The 
nodes  are  connected  bv  (usually 
directed)  arcs.  Tokens  are  graph 
primitives  that  provide  control.  The 
tokens  reside  in  tne  places  and 
represent  an  item  cr  condition  (for 
example  data  or  machines).  The  tokens 
move  or  flow  when  a  transition  'fires' . 
A  transition  mav  fire  when  each  input 
place  contains  at  least  one  token  in 
it.  After  firing,  each  outgoing 
place  from  that  transition  will 
contain  an  additional  token.  Generally, 
places  may  contain  more  than  one  token. 
Conflict  and  non-determinancy  are 
allowed,  and  can  be  advantageous  in 
modeling  real  svstems.  The  following 
diagrams  illustrate  this  conflict  and 
concurrency  : 


CONFLICT  CONCURRENCY 


In  the  CONFLICT  diagram.  tl  and  t2 
will  not  be  able  to  fire  s iraul t aneous 1 v . 
The  token  from  pi  can  enable  them  bctn. 
but  only  one  transition  mav  fire.  In 
the  CONCURRENCY  diagram.  t3  and  t-.  can 
fire  at  the  same  time  (in  parallel)  or 
in  some  other  specified  synchronous 
manner . 

Formally.  Petri  Nets  are  represented 
bv  a  four-tuple  PN  ; 

PN  =  . P .  T .  I .  0 )  where 

P  =  (pi.  p2 . pn  )  .  t  he  set 

■of  n  given  places 

T  =  (tl,  t2 . tm  )  .  the  set 

of  m  given  transitions 

1  =  ( I  ;  tl  )  .  I (t2 ) . I ( tm) ) . 

the  set  of  input  places  to 
each  transition 

0  =  (0(tl) .  0(t2) . 0(tm) ) . 

the  set  of  output  places 
to  each  transition 

The  marking  Mi  is  a  set  expressing 
token  number  for  every  place  at  a  time 
i.  An  example  Petri  Net  is: 


The  marking  is  ( 1 .  2.  0.  0,  0).  P  = 
(Pl.  p2.  p3,  PA,  p5).  T  =  (tl.  t2,  t3, 
t4.  tS).  l(tl)  =  pl  ,  I(t2)  =  Pl .  I(t3) 
=  Pl .  I(tA)  =  p2,  I(t5)  =  p3,  OiTl)  = 
p2,  Oit2)  =  p3,  0(t3)  =  PA,  O(tA)  =  p4 , 
0(t5)  =  p5. 

One  popular  method  of  Petri  Net 
analysis  is  that  of  reachability 
analysis.  Reachability  analysis,  first 
proposed  bv  Murata  [5],  involved  the 
creation  of  a  p  x  t  matrix.  This  matrix 
lias  n  rows.  where  n  is  the  number  of 
places  and  m  columns.  where  m  is  the 
number  of  transitions.  This  matrix 
illustrated  the  ' connections '  between 
places  and  transitions.  A  zero  entry 
would  occur  where  the  place  and 
transition  were  not  connected.  In 
short,  the  p  x  t  matrix  is  the  incidence 
matrix,  where  places  and  transitions  are 
connected . 
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have 


Reachability  analysis  can  be 
utilized  to  determine  system  success 
paths  for  reliability  evaluation  [A], 
The  state  equation  is  formulated  [5]: 

Z  =  AM 

Where  A''  is  the  transpose  of  the 
incidence  matrix  (p  x  t),  A  M  is  the 
change  in  marking.  and  I  is  the  firing 
count  vector  ( 1  when  column  is  included 
in  the  path  I  .  An  example  of 
reachability  analysis  is  presented  in 
the  next  section. 


STOCHASTIC  BEHAVIOR 


Simply  stated,  a  stochastic  process 
is  one  deyeloping  in  time  and  governed 
by  probabilistic  behavior  of  some  type. 
A  Petri  Net  can  be  a  stochastic  process, 
if  probabilities  are  associated  with 
some  relevant  features.  In  the 

literature  [1].  probabilities  have  been 
associated  with  either  transitions  or 

places.  It  is  proposed  that  both  places 
and  transitions  can  and  should  have 
associated  probabilities.  The  inherent 
difficulties  in  the  mathematics  will  not 
be  greater.  if  both  node  types  are 
considered  to  behave  probabilistically. 
This  association  will  not  change  the 
determinancy  of  the  firing  of  the 
transition,  but  will  state  the 
probability  of  firing  once  ‘chosen’. 

The  given  p  x  t  matrix  for 
analysis ; 

,P(pl)P(tl)  P(pl)P(t2)  ...  P(pl)P(tm); 

'P(p2)P(tl)  P(p2)P(t2)  ...  P(p2,'P(tm): 

.P(p3lP{tl)  P{p3)P(t2>  ...  P(p3)P(tm): 


:P(pn)P(tll  P(pn)P(t2) 


P  (  pn  )  P  ( tin  )  ; 


Where  P(pi'  is  the  probability  of 
success  associated  with  place  i  and 
Piti)  is  the  probability  of  success 
associated  with  transition  1.  For 
example,  if  place  m  and  transition  n  are 
connected.  and  their  respective 
probabilities  are  .9  and  .0,  the  correct 
probability  of  that  path  operation  is 
0.72. 

However.  if  the  path  is  obtained 
through  this  reachability  analysis  and 
the  reachability  matrix  is  utilized  to 
determine  the  total  system  probability, 
the  non-terminal  nodes  (thoses  in  the 
interior,  that  is  having  both  output  and 
input  nodes)  will  have  their 
probabilities  counted  twice. 

To  illustrate,  the  following  path; 


will  have  associated  reachability 
probabilities  calculated  as ; 


P(pl)P{tl>P(tl)P(p2)P(p2)P(t2) 

The  interior  nodes  tl  and  p2  will 
the  probabilities  counted  twice. 

To  counteract  this  effect.  a  small 
routine  should  be  utilized  in 
conjunction  with  the  formulation  of  the 
matrix  when  calculating  the  path 
probabilities : 


BEGIN 

Read  i  =  1  .  m  :  places 

For  i  =  O  ( t  _1  i  or  I  ( t  i  )  , 

Delete  P(pm)  for  second 
Incidence . 

Read  j  =  1  .  n  :  transitions 
Tor  j  =  0( pi )  or  I ( pi ) . 

Delete  P(tn)  for  second 
Inc idence . 

END 

A  routine  such  as  this  will 
counterect  the  affect  of  counting  each 
interior  node  twice  (it  should  be  no 
more  than  a  pair  since  self-loops  are 
normally  not  allowed  in  Petri  Nets  for 
ease  of  calculation).  As  previously 
stated.  Petri  Nets  are  advantageous  in 
that  much  of  the  analysis  is  easily 
performable  cn  a  computer ,  and  this 
routine  cleeirlv  is. 

An  example  of  this  routine  is  given 
utilizing  Petri  Nets: 


This  methodology  technique, 
and  A  M  calculation.  is 


after  A’ 
column 


compar ison . 


The 


comparison 

accomplished  via  column  addition, 
formulation  of  this  problem  is 

foil  OWE  : 


The 

as 
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SET  L  =  column  number 

(transitions  number). 
INITIAL  L  =  1 , 

BEGIN  L  =  L  +  1. 

Add  all  combinations  of  L 
columns , 

Compare  to  M. 

If  L  =  M.  success  path 
exists . 

If  pn  =  pm  (self-loop), 
disallow  success  path 
RETURN  to  BEGIN. 

Once  the  entire  set  (and  this  does 
require  complete  enumeration)  is 
compared.  the  exhaustive  set  of  paths 
through  the  Petri  Net  will  have  been 
obta i ned . 

The  success  paths  obtained  for  this 
Petri  Net  example  are: 

pl-tl-p2-t5-pA-t6-p6 

pl-t2-p3-t4-pA-t6-p6 

Pl-t2-p3-t3-p5-t7-p6 

pl-t2-p3-t'4-p4-t8-p5-t7-p6 

Pl-t2-p3-t3-p5-t8-p4-t6-p6 

In  addition  to  calculating  probabilites 
associated  with  correct  function  of 
these  paths  (assume  given  probabilities 
for  each  specific  node),  calculation  of 
time  required  (given  some  associated 
time  value  for  each  specific  node)  can 
also  be  made.  To  calculate  times 
necessarv.  times  for  each  node  will  be 
added  (same  allowance  for  interior  node 
must  be  made ) . 

CONCLUSION 

A  methodology  to  calculate  stochastic 
probabilities  in  a  self-loop  free  Petri 
Net  is  presented.  It  is  based  on 
reachabiJitv  matrix  and  analysis.  This 
method  will  calculate  success  path 
probabilities  for  Petri  Net  success- 
paths.  A  further  extension  will  allow 
time  calculation  for  the  success  paths 
of  the  Petri  Net  to  be  made, 

A  promising  area  for  further  research 
would  be  to  develop*  a  method  or 
heuristic  to  reduce  the  total 
combinations  reviewed.  Further.  studv 
in  the  MarKov  behavic-r  of  given  Petri 
Nets  should  be  made. 

A  major  difficulty  [7]  is  that  tire 
more  elaborate  the  Petri  Net  'the  better 
it  models  the  ' real -wor Id ’  )  the  more 


complex  the  calculations  become. 
Associating  time.  cost,  or  probabilities 
with  the  Petri  Net  may  make  calculations 
complex  or  theoretical  formulations  non 
comlplete , 
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TIMED  NEURAL  PETRI  NET 
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Ml  ^s;o; 

abstract 

The  conoept  ol  a  timed  neural  Petn  Net  is 
PI  eseniet)  vhich  is  lo  be  isomorphic  to  neural  net 
inrhiiefnire  The  standard  technitiue  ol  neural  net 
ai f liitectui e  is  .rppliert  to  this  net.  This  basir  Petri 
Net  concepts  have  been  extended  Some  examples  ar  e 
presented 

INTRODUCTION 

The  Petri  Net  concept  has  pioven  to  be  a  verv 
powerlul  tool  in  modeling  parallel,  sequential  and 
fonriirrent  processing  S,"  .  Even  though  standard 
Petri  Net  has  a  wide  range  ol  capabilities  loi 
modeling  svsiems.  inanv  authors  lound  it  somewhat 
restrictive  lor  general  systems  Several  extentions 
"  er"  uioposed  In  this  paper  we  propose  another 
''stoiiiion  which  is  aimed  a;  expanding  the  modeling 

ipahiiiiies  ol  Pen  i  Net  and  reducing  the  rnnipiexity 
111  paralielism  xiosi  ol  wliai  has  been  wi  nteii  about 
the  siiiiiiai  ities  and  dillerences  between  bi  am  and 
machine  i.a.S  have  explored  a  new  era  in  computer 
designs  including  massive  parallelism  and  luiiciioiial 
modulai  itv 

In  the  lirsi  section,  the  basic  deliniiions  ol  me 
buman  beam  and  neuron  are  pieseiued.  Section  11 
deals  With  the  basic  delnuiinns  ol  Petri  Net.  In 
sei  iioii  ill  a  leview  ol  some  extensions  to  siandaid 
petti  Net  )ia.s  been  piesented  While  in  section  1'  .1 
pew  ir.aiheiiiatiral  loimnlation  has  been  delined  In 
t  a  iii-w  re.ii  ii.ihiiii'.  ,'.:i:re|i!  has  been  piooven  In 
VI  iln- 1  TNPN I  lias  been  pi oposed  InVIl  an  example 
h.as  bei”.  y  iieii  and  Iniallv  some  mine  liisioiis  are 
•liawn  in  seclion  V  III 

1  !  '.'•-Ill  on 

'  iieuinii  IS  ihe  basn  'iiir  •ii.nal  imii  in  the  inani 
It  IS  sni.ill  III  are.i  and  cnlnno'  but  it  is  Imit,  it  is 
able  to  receive  hnndi  eds  ol  inputs  throiigb  ibe 
denii  lies  and  pioc»’ss  them  in  the  mil  bods  uoinai 
The  resuii  ol  nmiess  is  the  iiein 0:1  niiinui  The 
output  goes  thiougb  the  axon  hillock  axon  and  axon 
lei  iiiiiials  to  othei  iieiuons  hgnre  1  isaispiral 
neuron 


ElO  I  V  IS  pii  al  iirbi  on 

A  neiii  on  rmisisis  ol  ss  napses  soma  diMitii'es 
axon  axon  hillock  and  axon  leiminalt 


■Soma:  Soma  or  cell  bods  is  ihe  neural  process  place. 
This  place  receives  its  input  Irom  other  neurons 
through  the  dentrites  and  transmits  the  output 
through  the  axon. 

•Deiiintes  Dentrites  are  small,  thin  branches 
around  Ihe  cell  body.  These  dentrites  receive  the 
input  data  Irom  ibe  synapses  and  cai  ry  them  to  the 
soma 

Synapses;  Synapses  are  the  connection  between  the 
dentrites  ol  one  neuron  and  the  axon  terminals  ol  an 
other  neuron 

A.xon  hillock:  .Axon  hillock  works  as  a  threshold  'ai 
The  oiirput  data  cannot  How  through  the  a.xon  until 
its  weight  IS  equal  to  or  exceeds  the  weight  ol  the 
axon  hillock 

.Nxon;  There  is  one  axon  in  each  neuron  An  axon  is 
a  thin,  long  channel.  This  channel  terminate  bv  a 
branch  ol  lerininals  These  terminals  are  inputs  to 
other  neuron 

The  neuron  is  a  powerlul  cell  It  is  able  to  receive 
thousands  ol  inputs  and  transmit  thousands  ol 
ouipuis  concurrently  The  capability  ol  this  small 
cell  and  the  arc hici  wnhin  the  neural  net  are  good 
areas  m  research  For  many  years,  researchers  have 
been  working  very  hard  10  make  martune  works 
similar  10  the  human  brain 

I  Human  Brain; 

The  human  beam  is  one  ol  Ihe  most  romplex 
iiiucture  nets  m  the  known  universe  Over  lOO 
billion  individual  neurons  are  grouped  logelhei  to 
create  a  system  with  many  separate  areas  imodules' 
Each  area  is  able  to  process  a  specilic  type  ol  data 
'  1  .  The  cells  in  each  aiea  are  grouped  logether  to 
ci  eaie  a  siibsvslem  ol  liiei  ai  rhiral  super  post  non 
peimiiting  mini  niaiion  to  How  m  a  siraiilied 
mauner  layei  by  layer  1  Each  layer  is  delined  as  a 
i-iei  £,h)i  level  is  able  to  process  data  at  a  ceriam 
,)!  •■omple'.'.t',.  The  lower  layer  has  ihe  higher 
I .ipabihiies  in  pi'messhig  ‘.elluhir  inleiruiineriions 
b-'ween  ami  wiihm  l,v.. eis  .111  well  oiganized  The 
nixaiiieaiioii  is  coinni  ised  of  uaiallel  and 
hierarcbical  .11  chiieciui cs  Parallelism  is  between 
lavers  w  inie  ihc  hiei  arrlnral  at  r  Inter  lure  is  w  iiiiin 
each  layei  Compnlers  are  much  lasicr  in  speed,  bul 
Ibex  pertni  Ill  imci  Iv  in  lasks  that  emulaie  nie  natiu  al 
inlormaiion  ptoresxmg  that  humans  handle 
roiiiinelv  The  amazing  brain  arrhiieriure  is  the 
Sei  1  el  ol  brain  ■  ap.ibilii les  Reseai  chei  s  have  xei  y 
goiid  resulis  in  iiuegraied  r  11  runs  Thousands  ol 
gales  can  be  nn  a  single  chip  less  than  inch  in 
di.iiiieiei  Tile  new  lerhiinlogv  will  reiHire  the 
'  -mplfMlx  iiib'.ni'.!  .1  pow“;i  i;  mieiii^eme  me.' liu.e 


ii  Base  Lcl...iiioi.  ol  Pell  1  Nei  p'. 
pel!'  \ei  is  .1  inrniai  direried  vraph  i'-'  niaiysioe 

■  I  y,:  .as  .  1.1  ;,1  cii;.'!-,  an,:  , ;  1,, ',11 :  S'.si  i.'.S 

'  The  Mrs!  biparnie  reiuesenied  bv  riirles  is 

■  iMid  ,  liie  The  nvu.allv  1  epi  I'scnirrl  Ir. 

b."  \  i>  ■  .ille'i  M  .insii  iiviis  Al  I  s  '  'Miner t  ii  ansilions 
I"  maces  .viul  'ice  vci  y.a  II  an  .ai :  exisis  Mom  place 
I'l  .1  II  an?  1 1  lull  ii  am  a  n  ansiMoii  ir.  a  plai  e'  ih“ii  Ibe 
'el. are  ;s  c  allcil  .11.  iiiiuii  'ontpn! '  10  I  lie  :i  ansi  Mon 
I'.i"  !oie  Ibe  at  I's  IS  i'. 11 'ilionen  inin  iiipiii 

ai'S  -Al  and  i..niinii  oms  Ao  Sec  Mgui. 
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Fig  Ja  A  place  is  an  Fig.  2b.  A  place  is  an 
input  to  a  ii  ansition  output  ol  a  transition 

Foiiiiallv  a  Petri  Net  structure  consists  of  three 
tuples. 

PN-'PT.Ai  i|i 

WhereP-  p;  pi . p^  is  the  set  nl  places 

r  -  't|.  1 1 . ijy  IS  the  set  ol  transitions 

and  A|  'P\Ti  IS  the  set  ol  transition  input  arcs 
While  Ap  iTvPi  IS  the  set  ol  transition  output  arcs 
and  A  ■  A|  I  Ag 

The  places  iiia>  contain  tokens  represented  bs  dots 
Wiien  tokens  are  present  the  pern  .Net  is  called 
marked  Petn  Net  The  eseciition  of  a  maikcd  I'ctri 
\et  is  ^oveined  h\  the  lolloping:  A  transition  is 
enabled  then  all  ol  us  direct  input  places  contain  at 
least  one  token  An  enabled  transition  ran  lire  ihis 
lemoving  one  token  Irom  each  input  place  and  placing 
one  token  in  each  output  place  .  Alter  a  transition 
tires  a  ne»  disii  ibuiion  ol  tokens  occurs  thus 
proilunng  a  neat  inaikeiing 
In  a  more  lornial  wav.  *e  can  sa\  that  the  mat  ked 
p\  IS  del 'lied  as 

P,\  'PIMy'  i.'i 

aliero  F'  T  and  .A  are  as  mil'  and  Mo  iiio|  iiij,j 
tn„|,  U  here  M  denotes  the  dislt  ihuliiti.  cl  .'ckcns  in 
’.he  places  i:i  the  in.tial  maiketiiig 

III  'iiiiiip  F.Mension",  to  Siaiidard  Priii  '.>is 

Aaie.  ■.IrnsifiiiS  'S''ie  intl  oducetl  !  .  'e 
inr  a  .’.ir  "itier  "'.tensi'ins  a  ill  he  introduced  'a  Un- 
, Ml. .1.11  it  ivii  I  Nei  in.  I  .-asrihe  iiiodelim;  |io»ei  ol  tile 
ton!  the  eMensiops  consideied  in  this  section  are 
tin- del  inn  lon  ol  iinied  Poiri  Net  iTPN'.  multiple 
ai'S  inhil'ifoi  an  c  sroch.isfic  Pern  Nets  isp\  i 

TimiHa  usuallN  pla>s  .i  liiiidamenial  lole  m  the 
desri  ipiii'ii  ol  the  hehavioi  ol  anc  scsiein  the 
-.tan, hi;  d  PN  is  .iPic  Us  desc;  ihe  on;\  ihe  log-.c.il 
ill  iirtiii  e  III  i'  .itei.'i  .ind  not  then  liiiie  evaluation 
Th.'i  e  .lie  ill  1 1  ere  lit  a.r.  s  to  inti  odin  e  ;  iiiie  into  a 
ii.i-id.ird  P\  model  Ahn  t!  I'lows  ilie  rtesci  ipi mp  ol 
tile  f.  namic  bcha'.ic,  oi  s'.sieins  and  takes  into 
r  ui’t  n.ort|  ri.e  1’  ir.i  m  niiumn  and  the  dm  ation  nl 
each  a,  t lOii  pi  ilai  lin'd  hv  ilie  s'. st' in  IN  I"  A 
inritial  delinit'oi  -d  ,i  :p\  is  thus  is  lolloning, 

rp'.  P  T  N  M,  -v 

P  T.  .11  c  Id .ind  0  '.>]  0_, 

1  ,,  :!u  :.pt  cl  dei.r.i  .issoti.iicd  iiih  PN 

'  ..‘f;  '■«'!  'SF'-'  ni'.ulrls  *:*  nh'iUnPi!  1;- 

■j-  S'li  ;  in;,.  ^  irh  ft  III!  im  •  -u 

i»:  lit  I S .  (S  r>n  ■  [ill! I  ■  .Uii'oiH  ’■  ,U  I  !i.v( 

'll ; ;  oiii  i h»^  ini',  f •’<  Mie*  In  »iv; 

:  )ii  I  'll  Hill...  9.#' 1 .11.  ih.»i  ..  r r\  , > 


aefined  as 

SPN  IP.  T.  A.  Mg.  LI  111 

Where  P.  T.  A,  and  M  are  as  in  i3i  and  L-  1 1.  I  . . 

I,g  .  L  is  (he  set  ol  the  passible  marking  dependent 
firing  rates  associated  vith  the  PN  transitions. 

Many  other  esiiensions  were  introduced  to  the 
standard  PN  to  increase  the  modeling  power  of  the 
tool.  Some  ol  these  e.Ntensions  are  the  multiple  arc. 
inhibitor  arc  and  the  modeling  of  parbegin.  and 
parend  m  PN.  which  win  be  defined  by  small 
examples 

In  the  multiple  arc.  as  shown  m  Figure  3.  more 
than  one  arc  is  allowed  to  connect  a  place  to  a 
transition  and  a  transition  to  a  plare  A  transition  is 
enabled  when  all  us  direct  input  places  contain  at 
least  the  number  ol  tokens  equal  lo  the  number  ol 
arcs  between  the  place  and  the  transition  An 
enabled  tr.ansuion  can  lire,  ihus  removing  the 
htiftiher  of  tokens  ol  each  plare  equal  lo  the  number 
Id  .Ill's  between  ihe  place  and  the  transition  and 
pulling  the  .lies  beiween  ihe  iransiiion  and  the  place 


A  UN  With  multiple  arcs  A  compaci  lepreseiuaiioii 
o!  iiiul!  1  p •  w  wfrs 

Fig 

IN  Maihemaiu  al  roi  mulatioii  ol  ihecoiillici 
fauatton 

Consider  the  Peti ;  Net  in  Figure  i  whicli  we  will 
decompose  into  dll  lei  eni  levels  riietusi  level  i' 
tormed  b\  all  the- ,i  ansiiiitlii'g  places. 

I  I  I 

P,  P,-  'Pni 

..  .  ;',l  ;  Cj-C)  ',N 


(N' 


;  0»*'-  ^  '•  '• 

•  1 '  >'i*  “I  '•  Foi 1 1  I iiUi  1^\ ? Is  (iiul 

1  Msses 

Tti*  luloimaiion  lioni  ans  on?  ol  ihesp  pl.UfS  car. 

1  lo^  ro  a  rnnfjet  (eil  pl.Ke  in^iotiKiPK 
level  thi  ''‘Jgh  an  app'  opj  u  ansiMon 

P,  P. 

in.t  S  I  ,11.  Tl.e  pl.ii'fb  ol  a  !'•  pn  al  .iM’I  i  -u  e  tietioifil 
hv 

I  t  I 

^  1  ^  J  ^iii 

This  le'  '  ar  » 1  n  ion  !  f  nin 

j.|,irts»'.|  I'nc  3  5  to'-cl  and  li.\nMUi'  H.e  inlonnahon 
,j  I,  pr.  |v  10  niar PS  o!  M'p  ’  Uo  p|  Thp  !asf  lev  el 
labeled  J  ;  n*  ni»‘d  »'>  .ill  Um  i  of  oiv  hik 
IS  no»  olivion^  fhai  tiie  haUi  Minbei  ol  piafps  is 
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n  I  ■  It-  .  , . .  •  n  j 

T:ic-  secQiirt  level  »ill  he  rtiurtef!  le.io  li  elases 


'-nl 


1  ■  F|  !<  1 1'mriecied  lo  F.. 

:i>,ener,i|  Die  iif)  level  there  J  ,  I.  j?  rtividert 
i-.io  i'.  r  lasses. 


I  I  I 

1  r 

M  Sl-I 

I  1- 1  1 

t  here  ( ^  ‘  ^  I'oiineciert  to  Pj. 

I 

Note  that  there  1  i  ii|_|  are  not  necessarily 

diSlOlDI  hOteVfl 

ni- 1 

'  i-l  '  '  ■  ■  “l 

nil  ihese  pi  eliiniiiaiA  remai  ks  the  fond  in 
ennations  between  level  i-l  and  level  i  can  betrnien 
vs  lollows; 

I 

P,  Pj.  there  k£C|  iti 

'b  i^oneral  ronidn  etinaiions  between  level  i- 1 
■lli'l  ieli-l  :  ,  /III  Ije  t|  lileit  .is 


F.  Fj^  t'liei e  I: £  1,^  ii>. 

Kqu.ations  i  j'  and  :  pruvule  a  siep-b\  -step 
piofednie  lioiii  which  the  conllict  equation  between 
■mv  two  levels  can  be  written 
Now  dilfeient  i  vpresentations  o!  the  conflict 
*'|ib>t/(>;)  ran  b*'  nve.'i  111  'be  lollowin^.  te  till  lorus 
on  line.ii  .ilsebi  a  i  epi  event  at  ion  nsnui  boolean 
.it  iriiitietir  U e  inii  oduce  a  transition  man  is 


sir  I  I 


With  ei'iii  les 


'’ki 


r !  ifk-Ci 
'lo 


Si  oihei  wise 
II  the  ininrniaiion  at  level  i-l  is  lepreseined  h\  a 
rolnniii  niaii  is  wiih  entries  \  such  that 
C  I  if  p  IS  loaded 


''I. 


0  otherwise 
riieii  airer  the  firing  we  have 


\'  -  SI'"'  '  .y'  '  1-1 

■Now  the  iransatioii  mains  between  level  i  and  L 
where  I  I,  IS  given  by 


V(l  I  ,  Sjl.l-  I  ^1'  l.i- J  . yl  1.1  ,y, 


V  Rearhaliiliiv  in  the  weak  Sense 
Mow  we  introduce  a  new  concept  closely  related  to 
ihat  ol  reachability  J.S  which  we  shall  lall 
reachability  m  ihe  weak  senae 
I  L 

Consider  two  places:  P|  Pj, 

belonging  to  levels  i  and  L.  respectively.  To  know 
1  i 

Whether  a  token  originally  in  P^  ran  reach  Pj,.  we 
have  to  lorm  the  matrix  II  the  entry  a^j  is 

L 

equal  to  1  then  the  token  can  be  transmitted  to  P|( 


otlierwise .  .  (ji  It  cannot  Knowing  the 
leachahiiiiy  m  the  weak  sense  lor  all  pl.n  es  one  ,  a., 
iinffli-Uiait'ly  have  mlorniation  abutn  iin-  re.n  hatii;i!\ 
lor  the  net  in  large  loi  the  iiiodel  wv  oi  e  (  onsulei  mg 

'•  I  Timed  Neural  Pen  i  Net  'TNPNi 
Ihe  siruriure 'V  pe  and  behavioi  o|  Feiii\eis 
.ilmost  c  or  respond  to  the  Inpct  nl  ihe 

liindameiiial  nein  ,i|  eieiiienis  I  he  I  lined  Peii  i  \ei 
ha-,  been  modilied  to  a  I  mu-d  N.'iiiai  Petri  Net 
dplinilion  to  iipiease  il.e  powei  ol  ihe  mol  and  lo 
oeroaiodate  alni'-st  all  ilu  cl.'iii.'nts  oi  ili'  ne.n 
sssiem  The  moiiilKiion.v  aie  III  Ihe  lonienlb  111  ihe 
plat c  and  ihe  i cans 1 1 ion  l  ig ni  e  is  a  i-.  p ji  .ai  -  ei i  . e 

r\p\ 


;  ig  h  -k  lypir.vl  TNPN  tell 


This  ■■el!  has  '.in's  one  plate  .ind  one  iransiticn 
The  place  h.as  n  mpnis  and  only  o.ie  ouipiii  The 
tr.anSit!on  li.as  m  onipnis  and  only  one  input  “he 
phit'e  1  oiii.iins  colored  tokens  Each  color  has  a 
tliiieiem  wei-gbi  The  highei  weight  has  Ihe  higher 
PI  nil 'ly  in  proressing  and  vice  versa  The  output  arc 
woi  t's  its  a  tha  eshiild  -i  .  Tokens  are  not  allowed  to 
now  to  Ihe  ir.insiiion  h  oni  die  place  until  iheii 
weighi  esreeds  oi  equals  the  arc  weight  The 
iiansiuon  is  able  to  pibcess  and  iiansmit  ihe 
ii  anslerable  token  to  many  places  The  refinement 
ii.vnSiiion  IS  a  subnet,  see  Figure  “  y  This  subnet 
consists  o(  input  transition  l  and  one  output 
iraiisitpin  i  The  input  transition  t  receives  only  one 
copy  at  a  time  and  transmits  k  copies  to  k  places 
tiini  iiiientl'  Theonipni  iiansilion  is  an  OR 
iiansition  Me  i  is  enabled  when  any  one  o(  its  input 
.iM  S  in  amvei  Then.  i  is  able  to  transmit  copies  to  n 
pl.aces  and  so  on 

The  k  places  aie  input  to  k  statements  Sj  sr.  ..Sg 

The  miiuhei  ol  siaiemenis  in  each  TNPN  transition  is 
equal  to  the  numher  ol  rolois  which  are  used  lor  the 
l  olored  tokens  Earn  siaiemeni  is  designed  lo 
process  spec  lilt-  colored  tokens  For  instance,  when 
the  k  siaiemenis  receive  the  k  messages  only  one 
siaipinenl  is  going  to  process  the  message  The  rest 
ol  copies  in  ihe  ik- 1 1  statements  will  disappear  alter 
the  lile  time 

Finally,  me  similarities  between  the  neuion  and 
the  TNPN  are  as  lollows 

-  The  pl.ite  P,  lepiesems  ihe  cell  body  isomav  It  is 

■ihle  to  receive  many  inputs  and  transmits  ihe 
outputs  through  one  arc 

The  .iic  between  the  place  and  the  transition  in 
TN'PN  represents  the  axon  and  it  is  working  as  a 
threshold  a 

-  The  arrs  from  Ihe  l|  to  Ihe  places  P  | .  P  i . P  ^  are 

similar  lo  the  axon  terminals 

-  The  colored  tokens  represent  the  chemicals  in  the 
neuron 

-  The  input  airs  to  the  place  P  represent  the  axon 
terminals  trom  anoihei  neuron 


570 


Ffk.  ■  A  T\P\  uansilioi} 

li^ui  e  *'  IS  '•i.^Sh  ppit  I  \»»r  T^e  rpar  hahii  >'\  li  oin 

;''.V  -  !  .;■  -'liei  ;-’.'k'‘^  M  5'i.ir.:<.  ;s  h^rriuse  ut 

'  'ill!  i  1'  I  '  I  f'l  I  tf-S  Bi-if  M  ^  fr-  1 1- MhS •;-:!i  tli*^  S.ltlie 

;;;  "-•.il.  T\P\  a:!l  (;e!  :'.ei 

f'lgilir  s 

i  -C'  >  '(  >-  -F  > 

■'O  ''■■  ■  '•  v:)  c.‘  I  ■•*■)  ■■  o 

C' 


Fi«  ^  I'proiiipusnion  ol  i  T\P'.  mi.)  levp]  ^n.i 
'  l.isse^ 

rti?  ii'..'.:;i  (lifleieme  Ijetuppii  iiic  r.eis  in  fiiiurp  ’■ 

I 

ant;  r lipjre  i  is  the  output  o!  plate  H|  m  *  it'iie  '  -.s 

I- 1 

able  to  Mon  to  onl\  one  plate  m  <  lasv  C |  m  level 

i 

i-i,i  'A  hile  the  output  o(  the  plav  e  P I  inFigureS 
l!o»’s  to  all  plares  in  f  lass  C'  '  This  property 
protects  the  net  Ironi  the  tonliici  ioinple\it\  ami 
make  the  leai  hahiliiv  ol  »eak  sense  a  reat halulitt  ul 
strong  sense  bo  the  net  m  Figure  . I  is  in  i  mil  I  it  i 
while  ihe  net  in  Figure  S  works  in  parallel 

VII  Esample 

Fhe  r\p\  iiioilel  in  Figure  a  is  based  on  a  neural 
S'l  nt  nil  e  This  model  is  used  lo  ronirol  the 
I  drill  artimi  ol  some  imiscles  In  iliis  Figure,  the  mam 
neurons  whirh  pi  ovule  mptils  lo  Ihe  musrie  filters 

i  ) 

are  lepresenied  by  places  P|and  P^.  |  in  level  Li. 

Places  in  [,l  are  represented  as  the  pain  receptor  LJ 
are  represented  as  fiber  neurons  whirh  iranler  ihe 
messages  Iroiii  L)  to  Li  Plares  in  I..'-  level  repiesenl 
the  human  hrain  The  places  'n  this  level  aie  able  lo 
process  ihe  input  data  from  L2  or  Li.  and  arrording 
to  process  results  Li  will  send  messages  lo  Lb 
thrniigti  1,1  l,b  woiks  merhanirallv  The  degree  ol 
ronti  aciiuii  ol  ihe  miisrie  is  proportional  to  the  color 
1  1 

ol  tokens  at  place  P,  and  P|. ,  m  l.i  Alter  Ihe 
s 

contraction.  P|j;  in  L'  will  send  a  message  bark  to 
.1  1 

P|  ^  in  L3  ihroogh  P|.^  in  Ld.  indicating  thai  ih« 

ronirai  iion  is  done  The  places  m  Li  are  able  to 
inieriiipi  the  contraction  in  Lb  by  sending  higher 
weight  tokens  to  Lb  through  Li 


•  i»  I  "\F'.  im  nrtiim.ii'l  v.tiiie  iniisi  les 
VHI  vonclusion: 

The  modeling  ol  a  T\P\  ceil  has  been  discussed 
ih'Oiigh  flieaninglu!  miei  pi  eiai  imi  ol  vai  ions  Pein 
\ei  sii  uctiiro  lefhnuiues  A  maihematiral 
iMiei  pi  eiaiion  nas  been  given  lo  a  i  oiil  lif  I  equal  ion 
iiiheienilv  modeled  by  PN  A  new  rontepi  has  been 
iiiii'idm  ed  1.1  ledtii  e  the  roiilliri  romplexiiv  by 
seiiiiiOi, numbei  ol  messag>-s  equal  to  the  number  ol 
oiilpi.iii  Tills  model  can  be  considered  as  a  basis  (or 
iraiislaimg  hram  leaiui  et  loio  a  Iramewoi  k  ol  a 
fniiipuiiiig  system  Tiie  surress  m  implemenimg  this 
model  mil  lead  to  a  revolution  in  rompuier 
.)!<  Inlet  Hues  and  at  til  trial  inielligenre  s\  stems 
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ESTIMATING  STANDARD  ERRORS;  EMPIRICAL  BEHAVIOR 
OF  ASYMPTOTIC  MSE-OPTIMAL  BATCH  SIZES 


Wheyming  Tina  Song  and  Bruce  Schmeiser,  Purdue  University 


Abstract 


2.  Summary  from  Schmeiser  and  Song 


When  an  estimator  of  the  variance  of  the  sample 
mean  is  parameterized  by  batch  size,  one  approach  for 
selecting  batch  size  is  to  pursue  the  minimal  mean 
squared  error.  Recently  asymptotic  results  have  been 
obtained  for  the  mse-optimal  batch  size.  Based  on 
Monte  Carlo  experiments  we  conclude  that  the 
asymptotic  formula  is  an  accurate  approximation 
when  used  with  finite-size  samples  from  processes 
having  geometrically  decreasing  autocorrelations,  even 
when  the  ratio  of  sample  size  to  the  sum  of 
autocorrelations  is  as  small  as  five.  The  study 
considers  three  steady-state  data  processes,  four 
estimator  types,  and  four  sample  sizes.  Although  we 
don’t  discuss  batch-size  estimation  procedures,  the 
formula  is  a  foundation  for  estimating  the  optimal 
batch  size  from  data  in  practice. 

1.  Introduction 

Estimating  the  variance  of  the  sample  mean  is  a 
fundamental  problem  of  statistics.  In  simulation 
experiments  and  some  other  contexts,  the  data  are 
sometimes  assumed  to  be  from  a  covariance-stationary 
process  X  having  unknown  mean  n,  unknown  positive 
variance  Rq,  and  unknown  finite  fourth  moment 
Various  types  of  estimators  have  been  proposed, 
including  regenerative  [2],  ARMA  time-series  models 
(3,4, 12j,  spectral  [8],  standardized  time  series  (7,13), 
nonoverlapping  batch  means  [ll|,  and  overlapping 
batch  means  (10).  The  batch-means  and  some 
standardized-time-series  estimators  operate  on  batches 
of  observations;  therefore  the  statistical  properties  of 
such  estimators  depend  on  the  batch  size,  m,  as  well 
as  the  process  X  and  the  sample  size  n. 

We  study  here  the  mean  squared  error  (mse)  of 
nonoverlapping-batch-means  (NBM),  overlapping- 
batch-means  (OBM),  standardized-time-series-arca 
(STS. A),  and  the  nonoverlapping-batch-mcans 
combined  with  standardized-time-series-area 
(NBM+STS.A)  estimators.  We  apply  these  four 
estimator  types  to  three  data  processes,  each  have 
geometrically  decreasing  correlation  structure,  with 
Bernoulli,  normal,  and  exponential  marginal 
distributions.  Four  .sample  sizes  are  considered. 

Although  we  report  mse’s  of  various  combinations 
of  estimator,  data  process,  and  sample  size,  the  focus 
here  is  on  the  accuracy  of  an  asymptotic  formula  for 
batch  sizes  that  minimize  m.se.  In  Section  2  we  state 
the  result  in  discrete  time  from  Schmeiser  and  Song 
(14].  The  result  originally  appeared  in  Goldsman  and 
Meketon  (6),  but  in  continuous  time  and  without  the 
explicit  constant  7,. 


We  summarize  here  the  asymptotic  results  in 
Schmeiser  and  Song  (14)  for  comparison  to  the 
empirical  results  in  Section  3. 

For  h  =0,1,2,  ■  •  ■  ,  let  pf^  be  the  lag-h 
correlation  corr(X, ,  X^^/,).  Define  the  constants 

"^0  =  E  1  +2  E  /’a  ' 

h=-oc  h  =  l 


and 


li=  E  I*  IPa  =  2  E  *  Pa 

A  =  -oc  A  =  1 

For  example,  independent  and  identically  distributed 
(iidj  processes  correspond  to  75  =  1  and  7,  =  0.  These 
constants  play  a  central  role  in  determining 
asymptotic  mse,  as  shown  in  Proposition  1. 

Let  m  denote  the  batch  size,  F(m)  denote  the 
estimator  of  the  variance  of  ^e  sample  mean,  and 
*bias(V(m))  =  E(  V(m))  —  var(X)  denote  the  bias. 
For  NBM,  OBM,  STS.A,  and  NBM-tSTS.A 
estimators.  Proposition  1  holds. 


Proposition  1.  If  both  7i  and  the  fourth 
moment  exist  and  are  finite,  then  there  are 
constants  cj  and  c„  such  that 

lim  lim  n  m  bias(  F(m ))  =  — Cj  7,Ro ,  (2.1) 

m— *00  n/m— fcoc 

^3 

lim  lim  - var(F(m))  =  eJ7oRo)^  (2-2) 

m— *oc  n/m-^x  TfX 


and 


mse(  V{m))  ~ 


Ck  li  c,  7o 


Ro 


(2.3) 


The  optimal  batch  size,  m* ,  satisfies 


lim  n  ’^^m* 


U/3 


2(  — )(— )^ 

<^v  to 


(2.4) 


and  the  optimal  mse  satisfies 


lim  n*'^^mse(  U(m  *))  =  R Q 


,2/3 


1(2.5) 


In  terms  of  the  correlation  structure,  the 
asymptotic  bias  i.-  a  function  of  only  7i  and  the 
asymptotic  variance  is  a  function  of  only  7o.  The 
a,symptotic  mse,  optimal  batch  size,  and  optimal  mse 
are  all  functions  of  both  7o  and  7]. 
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Goldsman  and  Meketon  provide  the  constants  Cj 
and  e,  so  that  we  can  compare  NBM,  OBM,  STS.A, 
and  NBM-I-STS.A  estimators  in  terms  of  their 
asymptotic  mse’s.  Table  1,  from  Schmeiser  and  Song 
[14],  reviews  and  extends  their  results. 


processes  used  in  the  experiments  described  in  Sections 
3.2  and  3.3.  These  three  processes  have  identical 
correlation  structure  p*  =  P* ,  but  different  marginal 
distributions. 


Table  1:  Comparison  of  NBM,  OBM,  STS.A,  and  NBM+STS.A  Estimators 


lim  [2n(— 

n-.oo  7o 


lim  1  ‘mse(m*)  = 

n— ►cxi  o 


The  batch  means  methods  have  relatively  little 
asymptotic  bias;  NBM  and  STS.A  have  relatively  large 
asymptotic  variance. 

The  optimal  batch  size  constants  are  shown  in 
the  third  row.  The  batch  means  estimators  require 
batches  about  half  the  size  of  STS.A  estimators. 

The  optimal  mse  constants  are  shown  in  the 
next-to-last  row.  OBM  is  smallest,  with  NBM  and 
NBM+STS.A  a  little  larger,  and  STS.A  about  double. 

The  last  row  contains  a  measure  of  the  mse- 
robustness  to  batch  size.  Since  a  practitioner  needs  to 
estimate  the  optimal  batch  size,  and  since  the 
statistical  estimate  will  not  always  be  correct,  an 
appealing  property  of  an  estimator  is  that  it  not  be 
sensitive  to  batch  size  in  the  region  of  m*.  We  use 
the  second  derivative  of  the  mse  with  respect  to  batch 
size  as  a  measure  of  estimator  robustness.  This 
derivative,  based  on  equation  (2.3)  and  evaluated  at 
any  m*  satisfying  equation  (2.4),  is 


In  particular,  we  compare  the  (finite-sample) 
optimal  batch  sizes  m*  to  the  asymptotic  optimal 
batch  size 


m*  =  l  + 


Cp  To 


which  is  motivated  by  equation  (2.4)  and  the  fact  that 
m*  =  \  for  iid  data. 

We  also  make  three  comparisons  among  mse’s: 
the  (finite-sample)  optimal  mse,  denoted  by  mse(m*); 
the  (finite-sample)  mse  with  asymptotic  batch  size  rH* , 
denoted  by  mse(fH*);  and  the  asymptotic  mse 
evaluated  at  the  asymptotic  optimal  batch  size  fit*, 
denoted  by  inSe(r?f*)  and  defined  by 


rn3e(r7t*)  =  n 


^0  02/3  [ 


(•rmse  vim] 


The  constant  shown  in  the  last  row  of 

Table  1.  NBM+STS.A  is  the  most  robust  and  NBM  is 
the  least  robust  of  the  four  estimator  types. 

3.  Monte  Carlo  Results 

In  this  section,  we  report  some  Monte  Carlo 
experiments  for  finite  sample  sizes  and  compare  the 
finite-sample  results  to  the  asymptotic  results  of 
Section  2.  In  Section  3.1  we  discuss  the  three 


3.1  The  Three  Processes 

The  Monte  Carlo  experiment  described  in  Section 

3.2  is  designed  to  investigate  the  effect  of  sample  size 
and  marginal  distribution  on  the  accuracy  of  fit*  as  an 
approximation  for  m*  when  used  with  finite  sample 
sizes.  The  marginal  distributions  are  exponential, 
normal,  and  symmetric  Bernoulli. 

All  three  processes  have  the  correlation  structure 
Pt,  =  .  For  such  a  correlation  structure, 

=  (7o -l)/2.  so  the  optimal  batch  size  for  the  four 
types  of  estimators  we  consider  is  a  function  of  7o 
n  only.  Therefore,  the  conclusions  may  not  be  true 
for  other  correlation  structures. 

In  the  experiments,  we  want  to  specify  the  me^n, 
//  (irrelevant);  variance  of  the  sample  mean,  var(A'), 
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the  sum  of  correlations, '/q,  and  sample  size,  n.  So  for 
each  of  the  three  processes,  we  give  the  structural 
definition  and  formulas  for  calculating  parameter 
values  from  n,  var(X),  and  For  simplicity,  we  use 
p  =  pj.  For  all  three  processes,  P  =  {lo  ~  l)/('»o  +  1) 
and  _ 

R  - - , 

1+p  2p(l-p") 

n(l-pf 


EAR(l)  process  [9]: 


(m- '^•P-  1-p* 


where  €;  is  iid  exponential  with  rate  X  =  Rq  . 


AR(l)  process  [5]; 

Xt  =p  +  p{X,_i  -  p)  +it  , 

where  Cj  is  iid  normal  with  mean  zero  and  variance 
{i-P^)Ro- 


The  Symmetric  Two-State  Markov  Chain  (S2MC)  |l]: 

Let  be  a  two-state  dependent  symmetric 

Bernoulli  process  with  state  space  {c,d}  and  transition 
matrix 


(l-fp)/2  (l-p)/2 

(l-p)/2  (l+p)/2  - 


where  e  =  p  —  and  d  =  p  +  Rg^^ . 

The  S2MC_result  and  the  relationships  among  "jg, 
Rg,  and  var(X)  for  these  processes  are  derived  in 
Appendix  A. 


3.2  Experiment  1:  Monte  Carlo  Results  for  the  Three 
Processes 


This  section  contains  the  optimal  mse  and 
optimal  batch-size  results  of  a  Monte  Carlo 
experiment  using  the  three  processes  of  the  last  section 
with  lag-one  correlation  p  =  0.8182.  The  c.'timator 
type  is  OBM.  Sample  _size3  are  50,  500,  1000,  and 
5000.  In  all  cases,  var(X)  =  1;  therefore  the  variance 
of  the  observation  Ag  is  a  function  of  n.  The  mean  is 
(arbitrarily)  p  =  0. 

The  results  are  based  on  10000  independent 
observations  at  each  design  point.  The  mse’s  have 
standard  errors  smaller  than  .004.  The  optimal 
batch-size  results  reported  are  correct  to  within  about 
one  unit  of  the  least-significant  digit. 

Table  2  shows  a  comparison  of  finite-sample 
optimal  batch  sizes  and  the  asymptotic  optimal  batch 
size  ffl*.  The  rows  correspond  to  sample  sizes  and  the 
columns  correspond  to  process  types.  The  right-most 
column  shows  the  asymptotic  batch  size  rif*  as  a 


function  of  n.  Other  entries  are  estimated  optimal 
batch  sizes  based  on  the  Monte  Carlo  experiment. 


Table  2:  A  Comparison  of  Finite-Sample 
and  Asymptotic  Optimal  Batch  Size 
Estimator  Type  :  OBM 

Pa  =  (0.8182)* 

n 

EAR(l) 

AR(1) 

S2MC 

m* 

50 

7 

10 

12 

13 

500 

22 

25 

27 

27 

1000 

29-30 

31-32 

34-35 

34 

5000 

56-57 

57 

57-61 

58 

More  than  one  batch  size  is  shown  in  the  last  two 
rows  (n  =  1000  and  5000)  because  the  mse  function 
becomes  flatter  with  increasing  n,  making  Monte 
Carlo  identification  of  the  single  best  batch  size 
difficult. 

The  asymptotic  optimal  batch  sizes  for  all  three 
processes  should  have  the  same  value  because  of  the 
common  correlation  structure.  Indeed,  when 
n  =  5000  the  Monte  Carlo  estimates  are  all  essentially 
the  same  and  equal  to  nl*. 

S2MC  converges  to  ni*  quickest,  then  AR(l),  and 
finally  EAR(l).  Since  EAR(l),  AR(l),  and  S2MC  have 
different  marginal  distributions  (exponential,  normal 
and  symmetric  Bernoulli),  the  values  of  the  kurtosis 
(9,  3  and  1)  also  differ.  We  know  that  for  finite 
sample  sizes  the  optimal  batch  size  depends  upon  the 
kurtosis;  based  on  these  Monte  Carlo  results  we 
conjecture  that  the  larger  the  kurtosis  the  slower  the 
convergence. 

A  suboptimal  batch  size,  even  if  distant  from  m*, 
can  be  quite  satisfactory  if  the  associated  mse  is  close 
to  mse(m*).  Therefore,  we  now  compare  the 
differences  between  the  optimal  mse  and  the  mse 
associated  with  the  (possibly)  suboptimal  batch  size 

m*. 

Figure  1  shows  the  (estimated)  msc’s  for  the  three 
processes  for  sample  sizes  n  =  50  and  n  =  500.  In 
Figure  1  (and  in  Table  2)  the  largest  difference 
between  the  optimal  mse,  mse(m*),  and  the  mse  at 
the  approximated  optimal  batch  size,  mse(r7t*),  occurs 
for  n  =  50  for  the  EAR(l)  process,  where  the  mse  at 
rff*  =  13  is  about  20%  larger  than  the  optimal  mse  at 
m*  =7.  In  the  other  case  shown,  the  difference 
between  mse(m*)  and  mse(nf*)  is  negligible. 

So,  at  least  for  this  correlation  structure,  the 
asymptotic  batch  size  formula  (3.1)  accurately 
indicates  a  batch  size  having  near-optimal  mse  for 
sample  sizes  as  small  as  n  =  50. 

3.3  Experiment  2:  Monte  Carlo  Results  for  f’our 

Estimators 

In  this  section,  we  investigate  the  accuracy  of  the 
asymptotic  optimal  batch  size,  nt* ,  in  estimating  the 
optimal  batch  size  for  four  types  of  estimators:  NBM, 
OBM,  STS. A,  and  NBM-fSTS.A.  The  data  proce.ss  is 
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AR(1)  with  p  =  0.8182.  The  sample  size  is  n  =  500. 
Common  random  numbers  are  used  across  all 
estimator  types.  The  mse’s  have  standard  errors 
smaller  than  .004  and  the  optimal  batch  size,  m* ,  are 
correct  to  within  about  one  unit  of  the  least-significant 
digit. 


OBM  Estimators,  n=50 
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batch  size  m 
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Figure  2  shows  mse  as  a  function  of  batch  size  m 
for  NBM,  OBM,  STS. A,  and  NBM-fSTS.A  estimators 
for  n  =  500;  Figure  2(a)  shows  the  full  range  of  batch 
sizes  (the  feasible  ranges  of  batch  sizes  for  four 
estimators  are  listed  in  Appendix  B)  and. Figure  2(b) 
zooms  in  to  show  batch  sizes  in  the  reg'on  of  minimal 
mse. 
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Figure  1.  Mse  for  OBM  Estimators  applied  to  EAR(l), 
AR(1),  and  S2MC  Processes:  (a)  sample  size  n=50, 
(b)  sample  size  n=500. 


Figure  2.  Msc’s  of  NBM,  OBM,  STS..\,  and 
M3M+STS.A  Estimators: 

(a)m  =  2,  3,  ■  ■  ,  499,  (b)m  =  2,  3,  •  •  ■  ,  100. 
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In  this  example,  for  every  batch  size  OBM 
dominates  both  NBM  and  STS.A;  also  NBM+STS.A 
dominates  STS.A.  These  results  are  consistent  with 
the  asymptotic  values  in  the  last  two  rows  of  Table  1. 

In  this  example,  the  order  of  the  estimators  in 
increasing  robustness  (second  derivative  at  m*)  are 
NBM+STS.A,  STS.A,  OBM,  and  NBM.  This  order  is 
consistent  with  the  last  row  of  Table  1.  Visually,  the 
values  of  the  finite-sample  second  derivatives  appear 
consistent  with  the  values  of  the  asymptotic  second 
derivatives. 

In  this  example,  the  order  of  the  estimators  in 
increasing  optimal  batch  size  m*  is  NBM,  OBM, 
NBM+STS.A,  and  STS.A.  This  ordering  is  consistent 
with  the  asymptotic  order  in  the  third-to-last  row  of 
Table  1.  Moreover,  rft*  has  essentially  the  same  value 
as  m*  for  each  of  the  four  estimators,  as  shown  in  the 
first  two  rows  of  Table  3. 


Table  3:  Optimal  Batch  Sizes  and  j 

Optimal  Mse's  for  Finite  Samples 

Compared  to  the  Asymptotic  Formulas. 

Sample  Size  n  =  500. 

Property 

Estimator  Type  | 

NBM 

OBM 

STS.A 

NBM+STS.A 

24 

27 

49 

47 

m* 

20 

25 

50 

45 

in3e(  * ) 

0.141 

0.108 

0.293 

0,141 

mse(  rff  * ) 

0.11 

0.10 

0.19 

0.11 

mse(  m  * ) 

0.11 

0.10 

0.19 

0.11 

In  this  example,  the  order  of  the  estimators  in 
increasing  values  of  msc(m*)  is  OBM,  NBM, 
NBM+STS.A,  and  STS.A.  This  ordering  is  consistent 
with  the  asymptotic  values  in3e(nt*)  shown  in  the 
next-to-last  row  of  Table  1.  However,  except  for 
OBM,  the  optimal  mse,  mse(m*),  is  significantly  lower 
(3.5%  lower  for  STS.A)  than  the  asymptotic  optimal 
mse,  rn3e(nT*),  as  shown  in  the  last  row  and  the 
third-to-last  row  of  Table  3. 

In  this  example,  the  optimal  batch  size  m*  and 
the  asymptotic  optimal  batch  size  rfl*  have  essentially 
the  .same  value,  so  the  associated  mse’s  also  have  the 
same  values,  as  shown  in  the  last  two  rows  of  Table  3. 
To  what  extent  would  we  have  cared  if  the  batch  sizes 
had  not  matched?  As  discussed  in  the  last  paragraph 
of  Section  3.2,  all  that  is  important  is  whether  ff{*  can 
indicate  a  batch  size  having  near-optimal  mse  for 
finite  sample  sizes.  Therefore,  the  important 
comparison  is  between  the  last  two  rows  m.se(m') 
compared  to  mse(fjt*),  rather  *han  between  the  last 
row  and  the  third-to-last  row  mse(m*)  compared 
to  nT5e(rff*). 


4.  Summary 

In  this  paper,  we  study  the  accuracy  of  the 
asymptotic  optimal  batch  size  as  an  approximation  for 
finite-sample  case^.  We  consider  four  types  of 
estimators  of  var(X)  that  are  parameterized  by  batch 
size.  We  consider  three  data  processes,  all  have 
geometrically  decreasing  correlation  structure,  but 
different  marginal  distributions. 

In  the  examples,  the  optimal  batch  size  m*  and 
the  asymptotic  batch  size  frt*  yield  essentially  the 
same  mse  even  for  sample  sizes  as  small  as  n  =  50  and 
sum  of  correlations  ~!g  =  10.  That  is,  the  asymptotic 
optimal  batch  size  formula  (3.1)  worked  well  in  our 
examples 

Appendix  A: 

We  show  here  that  for  processes  with 
geometrically  decreasing  correlation  structure,  the 
relationships  discussed  in  Section  3.1  between  p,  '^q, 
and  Rq  hold.  We  also  show  jtbat  the  S2MC  process 
obtains  the  correct  mean,  var(X),  and  "jq. 

Since 


-,oSl+2\:/-^  =1+2V/  =l+2(--i— ), 

/>  - 1  *  - 1  *  b 


we  have  p  = 


■0 


-1 


'■n  +  1 

The  steady-state  variance  of  A'  is 


var(  A') 

Rn  n  -1 

=  -  i  1  +2  V  {\-h/n)pi 


^0  , 


1  +2  V  (1  -/i/n)/ 


n  1  n  (1  -pf 


„  1  +  '^  2/-  I-/)"  , 

so  /?o  =  nvar(A)  (- - - ■ 

l-/>  n  1-py 


.Now  ron.sider  the  S2MC  proce.ss.  Let  (A',  be 
a  dependent  syr^melrir  Bernoulli  process  with 
parameters  /i,  var(X).  and  ':g.  Let  the  state  space  be 
(e.rf)  and  the  transition  matrix 


(l+tO/2  (l-/')/2 

(l-/>)/2  (l+/>)/2 


We  first  show  that  =  p''  and  then  show  that 
c  —  p  —  ft 0  ' and  d  =  +  Rg^^^. 

Since  P  is  doubly  Markov,  at  steady  '.tate  c  and  d 
are  equally  likely  (i.e.,  Pr(.Y',  =e)  =  Pr(A',  =d)  =  l/2). 
Let  Z,  =(<1  — c)’'(A,  —  c ).  Then  Z,  has  state  spare 
{0,1}  and  transition  matrix  P  .\t  steady  state 


579 


cov  , 

=  E{Z,Z,^,)-EiZ.)E{Z,^,) 

=  (e(Z.  Z,+*  \Z,=Q)P{Z,=0) 

+  E(Z.Z.^J/.=l)P(Z.=l)j-(i)* 

1  +p*  J__J_ 

2  2  4 


where 


=Pr(Z.^*=l|Z.=l) 


by  induction  on  h.  Therefore, 

cov(Z,,  7,+*) 


corr(7,,  /,+*)  = 


var(7,] 


(1/4) 


Since  correlation  is  scale  and  location  invariant, 
Ph  =  corr(.Y,,  X.+J  =  corr(7,,  7,+*)  =  /  . 


Since  A",  =  e  +  (d  —  c )7,  ana  the  two  states  are 

id  *“ c 

equally  probable,  the  varianf'e  is  ^ ^  and 

4 

the  mean  is  /;  =  — ; .  (The  variance  is  (d  —  c)* 


times  the  variance  of  an  equally  probable  Bernoulli 
trial.)  Solving  these  two  equations  for  c  and  d  yields 
the  results. 


Appendix  B:  Feasible  Regions  of  Batch  Sizes 

The  feasible  ranges  of  batch  size  m  for  NliM, 
OBM,  STS. A,  and  iVBM  +  STS.A  are  listed  below: 

NB.M:  m  =  1,  2,  •  •  ,  [n/2\, 

OBM:  m  —  I,  2,  •  ■  ■  ,  n— 1, 

STS. A:  m  =  2,  .3,  •  ■  ■  ,  n , 

\BM  *  STS.A:  m  =  2,  3,  •  ■  [n/2j,  where  [oj 

denotes  the  greatest  integer  less  than  or 
equal  to  a. 
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SIMESTAND  SLvIDAT:  DIFFERENCES  AND  CONVERGENCES 


E.  Neely  Atkinson,  Barry  W.  Brown  and  James  R.  Thompson 
M.D.  Anderson  Cancer  Center  and  Rice  University 


Introduction.  Stochastic  process  modeling 
has,  until  fairly  recently,  exhibited  serious 
limitations  in  biomathematics,  econometrics 
and  other  areas  of  potential  application. 
Consequently,  investigators  have  frequently 
been  driven  to  linear  regression  and  other 
ad  hoc  techniques,  with  generally  poor 
results.  The  reason  for  the  difficulty  in 
applied  stochastic  process  modeling  is  that 
the  axioms,  since  the  time  of  Poisson,  have 
been  forward  in  time  direction,  whereas 
data  analytical  techniques,  such  as  those 
based  on  maximum  likelihood  arc 
backwards  in  time  direction.  Let  us 
consider,  for  example  the  following  forward 
axiomitization  of  cancer  progression 
considered  by  Bartoszynski,  Brown  and 
Thompson  (1982), 

(1)  For  each  patient,  each  tumor  originates 
from  a  single  cell  and  g?'ows  exponentially  at 
rate  «. 

(2)  The  prohability  that  a  tumor  of  size  Yj(t) , 
not  previously  detected  and  removed  prior 
to  time  t,  is  detectable  in  [t,  t  4  A'/  iz  b  Yj(t) 
At  +  o{  At). 

(3)  Until  the  removal  of  the  primary  ,  the 
probability  of  metastasis  in 

ft,  t  +  At]  is  a  Yq  (t)  ,  where  Yq  (t)  is  the 
mass  of  the  primary  tumor. 

(4)  The  probability  of  systemic  occurrence  of 
a  tumor  in  [t,  t  +  At]  is  XAt  +  o(At) 
independent  of  the  prior  history  of  the 
patient. 

Written  as  they  are,  in  standard 
Poissonian  forward  form,  the  postulates  are 
extremely  simple.  However,  if  we  attempt 
to  use  one  of  the  backwards  "closed  form" 


approaches,  e.g.,  maximum  likelihood 
estimation,  we  are  quickly  bogged  down  in  a 
morass  of  confusion  and  complexity.  For 
example,  in  order  to  use  the  maximum 
likelihood  approach,  we  are  confronted  with 
the  necessity  of  computing  a  number  of 
messy  terms.  We  show  one  of  these  in  (5). 

X  cc 

!  5  )  PfTj  =  S.  T.,>Sj  =j  j  ':e  x 

0  u 

at-ui,  ,  a  , 

//.  +  ae  ]expl -  /Jt-  u) - (e  -  1)  1  x 

a 

HlvfS-  S);S':c'^)Hfv'S-  "‘‘'''idudl  + 

00  oo 

f  f  W’iS-Si  a  ,  at  -?.u 

lie  p(t:h  e.xp!~ /I — (e  -  li}Xe  x 

0  u  « 

p(S'-  u:li  H(v<S-  Si;S  '-u:l)dii  dt . 

where 

nr  al  s  ,  ,  -s  -at  .  , 

t6)  -expl  —e  (c  -Vlogll  +  e  le  -  D] 

a 

4  —  -  log!  I  +  e  ^le  -  1)11, 

n  a 

(  7  )  p(t;z)  =  h  z  c  ‘^c.xp  [  -  -  1)  ]  , 

a 


(8)  w(y)-Xl  J  c  ^“'du-yl. 
0 

and  v(u)  is  determined  from 
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(9) 


The  ordei'  of  computational  complexity 
here  is  roughly  that  of  four  dimensional 
quadrature.  This  is  near  the  practical  limit 
of  contemporary  mainframe  computers. 
The  time  required  for  the  estimation  of  the 
four  parameters  in  the  above  model  was 
roughly  two  hours  using  the  robust 
optimization  routine  STEPIT  on  the  CYBER 
173.  A  moment's  reflection  reveals  the 
problem.  If  a  tumor  is  detected  at  a 
particular  time,  we  must  examine  all 
possible  paths  which  might  have  giv'en  rise 
to  its  origin.  For  example,  it  might  be  a 
metastasis  from  the  primary,  or  a 
metastasis  from  a  metastasis  of  earlier 
origin,  or  a  systemically  generated  tumor,  or 
a  metastasis  from  a  systemically  generated 
tumor,  etc.  Each  of  those  paths  is  easy  to 
write  in  the  forward  direction,  but  when 
computing  the  likelihood,  wo  reason 
backwards. 

In  1983  and  1987,  we  have  presented  the 
algorithm  SIMEST  for  dealing  with  the 
backwards/forwards  dilemma.  In  this 
algorithm,  we  have  returned  to  the  older 
goodness  of  fit  philosophy  of  Karl  Pearson. 
Namely,  we  consider  that  a  set  of 
parameters  is  close  to  truth  when 
simulations  based  on  them  produce  results 
which  appear  to  be  sufficiently  similar  to 
those  of  the  data.  The  procedure  is  based  on 
binning  "failure  times"  from  the  data,  and 
noting  whether  the  simulated  failure  times 
fall  in  the  bins  in  a  manner  similar  to  those 
from  the  data.  For  example,  if  we  use  the 
formal  goodness  of  fit  criterion,  we  have 


( 10)  (9)  = 

where  p  sj  the  proportion  of  simulations 
falling  in  the  j'th  bin  and  pj  is  the 


proportion  of  actual  data  points  falling  in  the 
J  'th  bin.  A  major  problem  with  the 
implementation  of  an  algorithm  based  on 
(10)  is  the  fact  that  fact  that,  as  we  proceed 
from  point  to  point  in  the  parameter  space, 
the  criterion  will  exhibit  simulation  induced 
noise.  We  have  addressed  this  problem  in 
Atkinson,  Brown  and  Thompson  (1987)  by 
utilizing  a  fixed  seed  approach.  Thus,  if  we 
are  using,  say  1,000  simulations  at  each 
parameter  value,  we  use  the  same  sequence 
of  seeds  for  each  parameter  values.  Such  an 
approach  enables  us,  rather  than  developing 
stochastic  optimization  procedures,  simply 
to  use  existing  deterministic  software  (e.g., 
that  of  Nelder  and  Mead). 

In  many  applications,  we  will  not  simply 
have  a  one  dimensional  response  variable 
(e.g.,  failure  time)  but  a  number  of  response 
variables.  Simple  Cartesian  binning  in  such 
a  situation  exhibits  numerous  difficulties, 
e.g.,  the  empty  space  phenomenon,  namely 
the  fact  that  most  of  the  bins  will  be  empty  of 
data  points.  The  major  purpose  of  this  paper 
is  to  address  alternatives  to  Cartesian 
binning  in  the  multivariable  response 
situation. 

Di  scussion.  Let  us  suppose  we  have  a 
Poissonian  model  M(9)  for  which  the 
vector  parameter©  is  unknown,  but  from 
which  we  have  a  set  of  n  observations  of  a  k- 
variate  response  variable  X  .  We  shall 
assume  a  value  of  0  and,  using  this 
parameter,  generate  a  set  of  N  simulated 
"data  points"  Y.  If  the  assumed  parameter 
is,  in  fact,  that  which  generated  the  actual 
data,  then  we  should  find  that  the  X  cloud  is 
indistinguishable  from  the  Y  cloud.  To 
determine  whether  this  is  so,  we  might 
determine  the  distance  of  each  of  the  A' 
points  from  each  of  the  Y  points  and  from 
each  other.  Such  a  matrix  of  distances 
should  provide  all  the  information  we  need, 
but  would  require  (n  +  N-  1)  n  elements 
for  each  simulation,  of  which  some  many 
thousands  will  he  required  in  order  to  find  a 


satisfactory  value  of  & .  Accordingly,  absent 
the  availability  of  highly  parallel  computer 
architecture  with  hundreds  of  CPU's,  we 


need  to  seek  more  economical  criteria. 

We  start  out  with  a  real  world  system 
observable  through  k-  dimensional 
observations  X.  We  believe  that  the 
generating  system  can  be  approximately 
described  by  a  model  characterized  by  the 
parameter  0  .  If  we  have  a  data  set  from 
the  system  of  size  n,  we  can,  for  a  value  of  0 , 
simulate  a  quasidata  set  of  size  N.  Then,  we 
compute  the  sample  mean  vector  and 
covariance  matrix  of  the  X  data  set.  Then, 
we  transform  the  data  set  to  a  transformed 
set  17  =  AK  +  6  with  mean  zero  and 
identity  matrix  /.  We  then  apply  the 
transformation  T  to  the  simulated  da^  set. 
We  compute  the  sample  mean  Xand 
covariance  matrix  1  of  the  simulation  data 
set.  If  the  simulated  set  is  essentially  the 
same  as  the  actual  data  set,  then  it  should 
have  for  its  transformed  values  mean  zero 
and  identity  covariance  matrix  I.  If  the 
underlying  data  distribution  is  not  too 
bizarre,  we  can  measure  the  fidelity  of  the 
simulation  data  to  the  actual  data  by 
computing  the  ratio  of  the  Gaussian 
likelihoods  of  the  transformed  simulated 
data  sets  using  the  mean  and  covariance 
estimates  from  the  actual  data  and  the 
simulated  data  respectively.  Defining 


( 11) 


^ki' 


j.i-i 


jt-  J 


where  is  the  j,l'th  element  of  the  inverse 
of  I,  we  have 


fI2) 


.V 


i2n) 


We  note  that  (12)  involves  the 
computation  of  only  n  +  N  distances.  For  N 
large  and  the  underlying  distributions 
Gaussian,  the  procedure  approaches  that 
based  on  the  likelihood  ratio  test  and  enjoys 
its  optimality  properties.  When  the 
underlying  distributions  are  not  Gaussian, 
the  procedure  is  no  longer  optimal,  but  will 
frequently  yield  the  correct  value  of  0  as 
n^oo.  We  note  that  it  is  by  no  means  correct 
to  assume  that  the  test  procedure  we  employ 
must  be  based  on  a  consistent 
nonparametric  density  estimator.  In  any 
event,  even  when  more  complex  algorithms 
are  required,  it  will  generally  be  useful  to  use 
(12)  to  move  the  starting  value  of  0  closer 
to  truth. 

If  the  distributions  of  the  response 
variables  are  very  different  from  Gaussian, 
it  may  be  appropriate  to  develop  a 
nonparametric  procedure  which  can 
become  more  and  more  complex  until  the 
number  of  distances  required  per  simulation 
can  approach  that  of  (N  +  n  -l)n.  For  the 
first  step,  we  again  carry  out  the 
transformation  T  mentioned  above  which 
transforms  the  data  set  to  one  with  mean  0 
and  covariance  matrix  /..  We  record  the 
distances  of  each  of  the  data  points  from  0, 
say  ldj}j-i^ri  .  those  of  the  simulated 

data  points  from  0,  say  fd gjl [=] If  we 
have  arrived  at  the  true  value  of  0  ,  then 
when  the  two  distance  lists  are  put  into  one 
of  length  n  +  N  ,  we  should  expect  to  find  an 
equal  distribution  of  simulated  and  actual 
data  values  throughout  the  list.  Letting  W 
denote  the  sum  of  the  ranks  in  the  total  list 
of  the  ?i  data  points,  we  know  that  if  the  true 
value  of  0  has  been  assumed  for  the 
simulated  data  set,  we  have  a  standard 
Wilcoxon-Mann-Whitney  situation.  Thus, 
we  lot 
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(13)  U=nN+  -  W; 

2  nN(n  +  N  +  1)  „  _  nN . 

'■  2  ’ 

U-Hu 

z= - 

If  the  value  of  &  used  in  the  simulation  is 
the  same  as  that  in  the  model,  we  know  that 
Z  is  normally  distributed  with  mean  0  and 
variance  1.  This  gives  us  a  natural  stopping 
rule  when  changing  0  in  the  optimization 
routine. 

Interestingly,  although  when  a  correct 
value  of  0  is  assumed,  Z  must  be  N(0,1),  the 
fact  that  Z  is  N(0,1)  does  not  necessarily 
guarantee  that  we  have  arrived  at  the 
correct  value  of  0  .  Note  that  we  made  our 
decision  based  on  relative  distances  of  the 
data  and  simulated  values  from  the  mean  of 
the  data.  It  is  easy  to  extend  this  relationship 
to  other  data  points.  For  example,  we  might 
rank  the  transformed  data  distances  from  0 
and  pick,  as  a  second  anchor  point,  the  point 
with  median  distance  from  0.  We  then 
compute  the  distances  of  the  data  points  and 
the  simulated  data  points  from  this  second 
anchor  point.  Again,  if  the  correct  value  of  © 
has  been  picked,  our  new  Z  must  be  N(0,1). 
Proceeding  in  this  fashion,  subsequent 
anchor  points  would  be  those  ranked  in 
distance  from  the  data  mean,  nl4,  3nl4,  n/ 8, 
5n/8, 3n/8,  7nl8,  nj  16, ... . 

One  difficulty  with  the  above  approach  is 
that  the  test  statistics  for  each  anchor  point 
are  not  stochastically  independent.  Thus, 
the  significance  levels  cannot  quite  been 
obtained  by  multiplication  of  tail  area 
probabilities  from  the  standard  normal 
distribution.  In  fact,  it  would  be  irrelevant  to 
do  so  anyway,  since  it  is  possible  to  have 
rank  matching  about  several  anchor  points 
without  actually  having  picked  0  correctly. 
Generally  speaking,  however,  it  will  be  very 
unusual  for  rank  tests  about  a  few  anchor 


points  all  to  pass  satisfactorily  the  N(0,1) 
test  unless  0  has  been  correctly  selected. 

Finally,  let  us  consider  the  possibility  of 
employing  the  SIMDAT  algorithm  in 
conjunction  with  SIMEST  in  order  to  effect 
parameter  estimation. 

Let  us  suppose  we  have  a  random  sample 
(Xj)  j=i  ton  of  ^  dimensional  vectors. 
SIMDAT  generates  pseudo  random  vectors 
from  the  underlying,  but  unknown 
distribution  that  gave  rise  to  the  random 
sample.  First  of  all,  we  carry  out  a  rough 
rescaling,  so  that  the  variability  in  each  of 
the  k  dimensions  is  approximately  equal. 
We  pick  an  integer  m  between  1  and  n  (the 
method  of  selecting  m  will  be  discussed 
shortly).  For  each  of  the  n  data  points,  we 
determine  the  m  -  1  nearest  neighbors 
using  the  ordinary  Euclidean  metric. 

To  start  SIMDAT,  we  randomly  select  one 
of  the  n  data  points.  We  then  have  m 
vectors,  the  data  point  selected  and  its  m  -  2 
nearest  neighbors.  The  vectors  PCj}j  =  ]  to  m 
are  now  coded  about  their  sample  mean 

m 

(14)  —  Z  X. , 

^  1=1  ' 

to  yield 

(15)  (X.)  =  IXj-X}.^^. 

Next,  we  generate  a  random  sample  of  size 
m  from  the  one  dimensional  uniform 
distribution 


This  particular  uniform  distribution  is 
selected  to  provide  the  desired  moment 
properties  below.  Now  the  linear 
combination 


584 


m 

(17)  X'  =  X 

1=1  '■  ‘ 

is  formed,  where  (ujJl  =  i  to  m  is  a  random 
sample  from  the  unifurm  distribution  in 
(16).  Finally,  the  translation 

(18)  X  =  X'+X, 

restores  the  relative  magnitude,  and  .X"  is  a 
simulated  vector  which  we  propose  to  be 
representative  of  the  multivariate 
distribution  that  generated  the  original  data 
set.  To  obtain  the  next  simulated  vector,  we 
randomly  select  another  point  from  the 
original  data  base  and  repeat  the  above 
sequence  (sampling  with  replacement). 
Although  it  is  very  easy 
and  quick  to  use,  SIMDAT  essentially  gives 
the  same  results  that  one  would  obtain  by 
laboriously  obtaining  the  nonparametric 
probability  density  estimator  and  sampling 
from  it. 

The  selection  of  m  is  not  particularly 
critical.  Naturally,  if  we  let  m  =  1,  we  are 
simply  sampling  from  the  data  set  itself  (this 
is  Efron's  "bootstrap"  ),  and  will  experience 
the  difficulties  present  when  one  attempts  to 
use  a  discrete  entity  to  approximate  a 
continuous  one.  When  we  use  too  large  a 
fraction  of  the  total  data  set,  we  will  tend  to 
obscure  fine  detail.  But  the  selection  of  m  is 
not  the  crucial  matter  that  it  is  in  the  area  of 
nonparametric  density  estimation. 
Experience  indicates  that  the  use  of  m 
values  in  the  5%  range  appears  to  work 
reasonably  well. 

In  the  present  application,  we  note  the 
symmetry  between  SIMDAT  and  SIMEST. 
SIMDAT  makes  no  model  assumptions 
beyond  continuity  of  the  density  function. 
SIMEST  "creates"  its  own  "data"  and  is 
completely  driven  by  the  model  parameter 
0  .  Using  a  Pearsonian  philosophy,  we 
should  select  a  0  value  which  causes  the 
data  clouds  generated  by  SIMDAT  and 


SIMEST  to  be  essentially  indistinguishable. 
The  means  for  carrying  out  this  task  in  a 
computer  efficient  fashion  is  a  matter  of 
investigation  for  us  at  present.  One 
technique  which  appears  attractive  is  to 
generate  many  SIMDAT  and  SIMEST  data 
sets  of  size  much  smaller  than  the  size  of  the 
actual  data  base  and  use  as  a  measure  of 
agreement  rankings  of  distances  from  the 
common  transformed  origin. 
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ACCELERATION  METHODS  FOR  MONTE  CARLO  INTEGRATION  IN  BAYESIAN  INFERENCE 


John  Geweke,  Duke  University 


Abstract 


E[g(9)  lyj 


Methods  for  the  acceleration  of  Monte  Carlo 
integration  with  n  replications  in  a  sample 
of  size  T  are  investigated.  A  general 
procedure  for  combining  antithetic  variation 
and  grid  methods  with  Monte  Carlo  methods  is 
proposed,  and  it  is  shown  that  the  numerical 
accuracy  of  these  hybrid  methods  can  be 
evaluated  routinely.  The  derivation  indicates 
the  characteristics  of  applications  in  which 
acceleration  is  likely  to  be  most  beneficial. 
This  is  confirmed  in  a  worked  example,  in 
which  these  acceleration  methods  reduce  the 
computation  time  required  to  achieve  a  given 
degree  of  numerical  accuracy  by  several  orders 
of  magnitude. 


Background 

In  a  statistical  model  the  distribution  of 
a  vector  of  random  variables  y.j.  -  (yj^,  ..., 
yj)  is  assumed  to  be  known  up  to  a  vector  of 
parameters  9  =  ...,  9^)'.  The  model  may 

be  expressed  by  the  probability  density 
function  whose  kernel  is  the  likelihood 
function  L(y.j.|#),  with  the  functional  form  of 
L  known  and  9  unknown.  In  Bayesian 
inference  the  unknown  vector  of  parameters  9 
is  regarded  as  random,  and  its  distribution 
conditional  on  the  observed  vector  y^  is 
derived.  If  ir(9)  is  the  prior  probability 
density  of  the  parameters  then  the  conditional 
distribution  of  9  is  pCSjy'p)  “  L(y.j.|^)ir(S )  ; 
pC^lyf)  is  known  as  the  posterior 
distribution  of  9.  Virtually  all  Bayesian 
inference  problems  are  cast  in  the  form  of 
determining  the  expected  value  of  a  function 
of  interest  under  the  posterior: 

E[g(^)|yx]  “  ®  being  the 

parameter  space. 

Among  the  attractions  of  Bayesian  inference 
are  its  provision  of  a  logically  consistent 
approach  in  complex  situations  and  its 
incorporation  in  decision  theory  (Berger  and 
Wolpert,  198i)).  There  have  been  very 
substantial  problems  in  the  implementation  of 
methods  for  Bayesian  inference,  however: 
these  problems  have  been  approached  on  a  case- 
by-case  basis,  with  limited  development  of 
generic  methods;  analytical  results  have  been 
obtained  only  in  a  limited  set  of  cases;  and 
computations  have  been  slow,  relative  to 
classical  methods.  The  development  of 
analytical  generic  methods  is  precluded  by  the 
intractability  of  the  integrals  in  E[g(®)|yx] 
that  emerge  in  all  but  the  very  simplest  prob¬ 
lems  . 

The  generic  problem  requires  determination 
of 


g(9)x(9)L(y,^|e)dS 

e 

•  [|^7r(9)L(y^ltf|)d9]'’-. 

In  applying  the  model  the  functional  form  of 
L(yxl^)  ™ay  itself  be  in  doubt.  If  there  are 
m  such  models  indexed  by  j ,  with  prior 
probability  tt,  ,  then  the  posterior 
probabilities  of  the  models  themselves  are 

(y\9)d9  .  In 
onal  on  model 
choice,  is  the  average  of  conditional 
E[g(«)iyx] .  weighted  by  the  pj . 

Monte  Carlo  Integration  with  Importance 
Sampling 

In  Monte  Carlo  integration  with  importance 
sampling  a  sequence  of  independent  and 
identically  distributed  random  vectors  '{®i}’i-i 
is  drawn  from  an  importance  sampling  density 
!(•);  typically  n  is  on  the  order  of  10^. 
Heuristically ,  the  importance  sampling  density 
should  mimic  the  posterior  density  and  to  this 
end  define  the  weight  function  w(9)  - 
p(«)/l(?).  The  value  of  gj  e  E[g(e)lyx]  is 
approximated,  numerically,  by  g^  T  “ 

^i-lg(^i)'^(^i)/^i_l'^(^i)  ■  Statisticians  have 

been  aware  of  this  approach  for  over  twenty 
years  (Hammersley  and  Handscomb,  1964).  Kloek 
and  van  Dijk  (1978)  conjectured 

n^/^(g^  j  -  gj)  *  N(0,  provided  a 

2 

method  of  approximating  Oj  numerically.  This 
is  an  important  result,  for  it  allows  routine 
evaluation  of  the  numerical  accuracy  of  g^  X- 
It  has  been  further  shown  (Geweke,  1986)  that 
if  I(#)>0V^E®.  then  gn  T  ^T’  tn 

addition  E[w(f)]  <  w,  and  E[g^(9)w(9)]  <  «>, 
then  n^''^^(g^  T  -  g)  »  N(0,  Oj)  ,  where  " 
E{[g(^)  -  gx]^"(^)};  and  if 

^n,T  -  ^i-lU^^i)  ■  en,T]^«('>i)VK.iw(«i)]2 

A  2  2 

then  no^  .j.  -♦  o-j..  (All  convergence  is  in  n, 

the  number  of  Monte  Carlo  replications.)  The 
conditions  for  convergence  guide  the  choice  of 
I(#),  which  is  an  experimental  design  problem. 
Over  the  past  two  years  readily  applicable 
methods  for  analytical  identification  of 
families  of  importance  sampling  densities  that 
satisfy  the  moment  conditions  have  been 
developed  (Geweke,  1986).  Algorithmic  methods 
for  construction  of  importance  sampling 
densities  within  these  families  have  been 
devised  and  implemented  (Geweke,  1988a),  but 
this  work  is  in  an  early  stage. 


proportional 
this  case  E 


[g(»)^yxJ.’'^incJon2tl 
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Antithetic  Acceleration 

There  are  well-known  variants  on  Monte  Carlo 
which  accelerate  convergence,  but  until  quite 
recently  none  has  been  applied  to  Bayesian 
inference.  A  simple  generic  method  is  antithe¬ 
tic  acceleration,  which  uses  the  technique  of 
antithetic  variates  introduced  by  Hammersley 
and  Morton  (1956).  A  pair  ol  identically 
distributed  but  negatively  correlated  vectors, 

and  9®,  are  drawn  from  l(*)i  and  the 
value  of  g  is  approximated  numerically  by 
* 

®n,T 


2"fJ[g(9^)w(9^) 


+  g(9®)w(9®)] 


*2 

and  as  before  we  can  compute  ^  such  that 

A  it  2  Hr  2 

nOj^  -»  <7.j.  .  No  matter  what  the  scheme  for 
inducing  negative  correlation  between  9^  and 

n 

9“,  so  long  as  these  vectors  are  drawn  from 

1(9)  the  numerical  accuracy  of  the  procedure 
may  be  evaluated  using  this  result.  In  a 
leading  simple  class  of  cases  it  has  been  shown 

*2  9 

(Geweke,  1988b)  that  Ta-j.  /a^  coverges  almost 


surely  to  a  finite  positive  constant  as  T  -»  <», 
and  the  expression  for  the  limit  indicates  that 
the  constant  is  smaller  the  more  nearly  linear 
is  the  function  g.  An  immediate  implication 
of  this  result  is  that  the  required  number  of 
Monte  Carlo  Iterations  to  achieve  given 
numerical  accuracy  relative  to  the  dispersion 
of  the  posterior  density  decreases  as  sample 
size  increases.  This  raises  the  possibility 
that  asymptotic  approximations,  like  those 
developed  by  Tierney  and  Kadane  (1986),  do  not 
necessarily  become  preferred  on  practical 
grounds  as  sample  size  increases. 


Grid  Acceleration 

Antithetic  acceleration  is  but  one  example 
of  an  entire  class  of  extensions  of  Monte 

Carlo.  In  general,  an  m-tuple  9^,  9? 

may  be  drawn  on  the  i'th  Monte  Carlo 

replication,  each  9|  having  probability 

density  1 19),  9|  independent  of  9^  If  i 

k  P 

f*  k,  but  and  6^  are  in  general 

dependent.  The  value  of  -  E[g(^)j  is 

approximated,  numerically,  by  5n  m  T  "" 
^i_l2;j_ig(«h«(«’|)/2:i-l2^j-l«(«l)  ■  (Antithetic 
acceleration  is  the  special  case  in  which  m  - 
2  with  and  negatively  correlated.) 

It  is  not  hard  to  show  that  for  fixed  m, 

"^^^k,m,T  -  St]  ^  ‘’m,T)' 


a2 

‘’n,m,T  “ 


^k{[^-l«(ol)g('l)/^I-l''('’l)-6n,m,T]' 


then 


no? 


2 

-  ^  .p.  Hence  any  acceleration 

method  or*  this  form  can  be  routinely  employed. 

The  desigxi  of  nonrandom  sampling  schemes  is 
a  question  of  very  great  practical  importance. 
Grid  or  quadrature  methods  provide  one  basis 
for  these  schemes.  To  illustrate  a  simple 

St  St 

grid  method,  let  u  -  (uj^ ,  ...  ,  Uj^)  be 
cjiosen  at  random  from  the  unit  hypercube  in 
define  an  2 -grid  in  R'' 

“  -  ('^l . -k'  ■  “j 

(i-0,_...  ,  i-1)  modulo  1.  Map 


and 

points  of  the  form 
u*  +  i/f 


by  all 

Uk)', 


t^is  grid  into  points  in  9  via  the 

inverse  c.d.f.  of  the  importance  sampling 
density  1(9).  This  provides  a  feasible 
method  for  low-dimensional  problems.  For 
higher-dimensional  problems,  this  particular 
method  is  impractical  because  of  the  rapid 
increase  in  In  these  problems  the  grid 

mesh  need  not  be  the  same  for  all  parameters; 
grids  may  be  used  for  some  parameters  but  not 
others;  or,  antithetic  variates  may  be  used 
for  some  parameters  and  grids  for  others. 

To  provide  the  intuition  for  the  accelera¬ 
tion  inherent  in  grid  methods,  consider  the 

motivating  problem  of  computing  J  xdx  -  1/2. 


Let  xj  -  U(0,  m'^)  and 
j-2 . m,  and  denote 


+  (j  -I)/™. 


If 


the  integral  is  approximated  by  n, 

"  then  var(Xj^  -  l/(12m^n).  To 

achieve  var(Xj^  m^  "  “  l/(12m  v  ) 

Monte  Carlo  iterations  are  required.  With 
computation  time  proportional  to  mn, 
computation  time  required  to  achieve 
var(Xj^  -  V  is  proportional  to  l/(12mv  ). 
This  suggests  that  required  compucation  time 
with  grid  acceleration  is  approximately 
inversely  proportional  to  the  number  of 
points,  that  is,  approximately  proportional  to 
the  mesh  of  the  grid.  For  the  generic  case  in 
Monte  Carlo  integration  this  conclusion  seems 
a  reasonable  conjecture,  because  of  the  local 
linearity  of  inverse  c.d.f, 's  and  functions  of 
interest.  For  mixtures  of  grids  for  some 
parameters  and  antithetic  variates  for  others 
the  situation  is  less  clear.  Yet  another 
complication  is  the  choice  of  n;  since 
variance  is  proportional  to  n'^  and  m'^, 
computational  efficiency  alone  would  suggest 
n-1 .  However,  n  >  1  is  required  in  order  to 
provide  m  T  assessment  of  the 

numerical  accuracy  of  the  whole  procedure.  To 
explore  these  practical  matters  we  turn  to  a 
worked  example . 
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A  Worked  Example 

We  apply  these  methods  to  one  of  the  most 
widely  used  econometric  models,  the  linear 
model  with  first-order  autocorrelation: 


Geweke  (1986).) 

This  procedure  determines  the  importance 
sampling  density  I  (p)  for  p.  Let  J  (p) 
be  the  corresponding  c.d.f.,  and  J*'^(p)  the 
inverse  c.d.f.  Given  a  preset  number  of 


y^  -  x^/9  +  £^;  -  N(0,  a‘^(l-p^) ‘^)  ; 

ft  “  ^'t-1  ’’f  . 

-  lIDN(0,o^)  . 

The  dependent  variable  is  y^;  Xt  is  a  k  x 
1  vector  of  independent  variables;  is  an 

unobserved  disturbance'  )3  is  a  vector  of 
unknown  parameters;  is  an  unknown 

positive  variance  parameter;  and  the  autocor¬ 
relation  parameter  p  is  less  than  one  in 
absolute  value.  Letting 


yi(p)  -  (i-p^)^'^^yx; 

-  (l-p2)V2,^^, 

We  employ  two 
modelled  by 

yt(p)  "  vt  ■  pyt-i; 

Xtj(p) 

"  ’'tj  ■  ^^t-l.J- 

x^(p)  -  [x^j^(p) . 

j 

yt  -  ^0 

gridpoints  m,  draw  u^j^  -  U(0,  m‘^)  and  set 
“ij  ^  '^il  +  ....  m);  then  pj^j  - 

J*(Uj^j).  We  confine  the  grid  to  the  single 

parameter  p.  Conditional  on  p ,  o  and  ^ 
are  independent  and  conditional  on  p  and  a 
the  distribution  of  /3  is  symmetric.  Thus, 

P  appears  well  suited  for  antithetic 
acceleration;  if  a  function  of  interest  is  an 
element  of  P,  then  antithetic  acceleration 
provides  the  exact  mean  of  that  element  condi¬ 
tional  on  p,  and  the  numerical  approximation 
is  exact  up  the  approximation  of  the 


+  ^l^tl  +  ^2’'t2  +  'f 


by  a  standard  transformation  (e.g.,  Theil 
(1971,  pp .  250-233))  Che  log- likelihood 
function  is 

-Tlog(<7)  +  (l/2)log(l-p^) 

-  (l/2<7^)S^_Jy^.(p)  -  x^(p)'/3]^ 

A  standard  conjugate  prior  is  n{p,  a,  p)  a 
ct’*'.  Conditional  on  p,  the  posterior  in  a 
is  inverse -gamma,  and  conditional  on  p  and 
a  the  posterior  in  p  is  multivariate 
normal.  With  this  in  mind,  denote  the 
posterior  distribution  of  p  and  a 
conditional  on  p  by  ^(o|p)^(^|o,  p) .  It  is 
straightforward  to  sample  from  and 

4>^’)  ■  Given  an  importance  sampling  density 
I  (p)  for  p,  we  may  choose  the  importance 
sampling  density  for  all  parameters  to  be 
I(«)  -  I(/9,  a.  p)  -  I*(p)V'(<7|p)9(^|<7,  p)^ 

Since  the  range  of  p  is  limited,  if  I  (p)  > 
0  Vp  s  [■!.  l]  then  E[w(J)j  <  and  for 
most  functions  of  interest  it  will  be  readily 
verified  that  E[g^(S)w(#)]  <  «>.  A  normal 
importance  sampling  density  is  therefore  a 
reasonable  candidate  for  I  (p) . 

The  posterior  mode  is  found  by  a  global 
Hildreth-Lu  (1960)  search  in  p,  followed  by 
local  maximization  with  a  convergence 
criterion  of  10”  in  p.  For  local  values 
of  p,  the  posterior  was  maximized  in  p  and 
a,  and  the  log  posterior  was  compared  with 
its  value  at  the  global  maximum.  For  each 
such  comparison,  a  normal  density  with  mean  at 
the  global  mode  may  be  fit  to  the  two  points; 
the  standard  deviation  of  this  density  is 
greater,  the  smaller  the  difference  in  the  log 
posterior  at  the  two  points.  The  standard 
deviation  of  the  importance  sampling  density 
is  taken  to  be  the  largest  such  standard 
deviation,  over  the  range  for  which  the 
difference  in  log  posteriors  is  less  than  20. 
(Choosing  the  largest  standard  deviation  is 
likely  to  reduce  e[w(#)],  as  discussed  in 


where  y^  is  the  log  of  consumption  of  a 
consumer  good,  Xj.^  is  the  log  of  income,  and 
X(.2  is  the  log  of  the  price  of  the  consumer 
good  relative  to  a  general  price  index.  In 
the  first  example,  y^.  is  log  consumption  of 
spirits,  and  the  data  are  the  69  annual 
observations  used  in  the  example  of  Durbin  and 
Watson  (1951).  The  least  squares  coefficients 
are  bp  -  4.607,  b^  -  -.120,  b,  -  -1.228; 
the  posterior  mode  is  j&o  ”  2.453,  -  .622, 

^2  “  -.929,  p  -  .993.  In  the  second  example, 
y^  is  log  consumption  of  textiles,  and  the 
data  are  the  17  annual  observations  provided 
by  Theil  and  Nagar  (1961)  .  The  least  squares 
coefficients  are  Dq  -  1.374,  b^  -  1.143,  b2 
-  -.829;  the  posterior  mode  is  “  1.359, 

-  1.149,  ^2  "  -.827,  p  -  -.125. 

To  explore  the  question  of  computational 
efficiency  as  a  function  of  the  number  of  grid 
points,  six  functions  of  interest  were 
selected;  the  posterior  means  of  p^,  ,@2  ■ 

and  p;  the  predictive  mean  of  a  one-period- 
ahead  value  of  y^.,  taking  at  its 

sample  mean  and  at  its  standard 

deviation;  and  P[fp|  <  •  l]  and  P[p  >  o]  , 
each  computed  under  the  posterior.  In  the 
spirits  example  P[|p|  <  .  l]  -  0  and 
P|_p  >  o]  -  1,  and  the  latter  two  functions  of 
interest  are  not  reported.  In  each  case,  n  - 
2,000.  In  one  set  of  experiments  P  was 
sampled  by  simple  Monte  Carlo,  and  in  the 
other  antithetic  acceleration  for  p  was 
employed.  Software  developed  by  the  author 
reports  the  posterior  mean,  the  posterior 
standard  deviation,  m  T’  computation 

time.  Given  these,  computation  time  required 
for  °'n  m  T  to  one-percent  of  the 

posterior ’ standard  deviation  was  computed  for 
the  first  four  functions  of  interest,  and 
computation  time  for  ''n  m  T  to  .005 

was  computed  for  the  two ' probabilities . 

Results  are  reported  in  Tables  1  and  2, 
Computation  times  are  given  in  seconds,  using 
a  MlcroVax  II  and  double  precision  arithmetic. 
(Each  valuation  of  the  posterior  density  for  a 
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different  value  of  p  requires  re-solution  of 
the  least  squares  normal  equations.)  Not 
surprisingly,  without  antithetic  acceleration 
computation  times  for  and  ^[^2] 

unaffected.  Increased  m  provides  some 
reduction  for  the  prediction  which  involves 
p,  in  the  textiles  example  but  not  the 
spirits  example.  With  antithetic  acceleration 
the  general  pattern  of  reduction  in  computing 
time  is  the  same  for  all  functions  of 
interest.  In  practical  terms,  the  reduction 
in  time  afforded  by  the  combination  of  grid 
and  antithetic  acceleration  is  quite 
substantial.  The  textiles  example  requires 
about  three  minutes  with  no  acceleration;  with 
ra  -  25,  ten  seconds  suffices  to  produce  n  = 

20  and  meet  the  chosen  numerical  accuracy 
criteria.  The  spirits  example  requires  about 
fourteen  minutes  with  no  acceleration;  with 
m  -  25,  twelve  seconds  suffices  to  produce  n 
-  20  and  meet  the  accuracy  criteria. 

As  expected,  increasing  the  value  of  ra 
beyond  25  further  reduces  computation  time. 
However,  the  need  to  have  some  minimum  number 
of  n,  the  number  of  Monte  Carlo  iterations, 
limits  these  gains  as  a  practical  matter.  If 
one  uses  a  rule  of  thumb  that  sets  a  minimum 
value  of  n  (here  we  use  n  -  20)  then 
nothing  is  gained  by  increasing  m  over  the 
value  needed  to  achieve  the  required  numerical 
accuracy  in  the  minimum  number  of  Monte  Carlo 
iterations.  Put  another  way,  standards  for 
numerical  accuracy  and  Che  requirement  of  a 
minimum  number  of  Monte  Carlo  replications  to 
assess  numerical  accuracy  reasonably  well 
imply  an  optimal  value  for  m.  Here,  that 
value  appears  to  be  about  25,  but  of  course 
this  result  is  specific  to  the  two  examples. 

Based  on  a  very  simple  motivating  example, 
we  conjectured  that  required  computation  time 
would  be  approximately  Inversely  proportional 
to  m.  Evidence  on  this  conjecture  is  provided 
in  Table  3,  which  reports  the  product  of  m 
and  computation  time  given  in  Table  2 .  For  the 
textiles  example  this  product  is  roughly 
constant  across  m.  For  the  spirits  example 
the  product  declines  as  ra  increases.  The 
explanation  for  this  behavior  lies  In  the 
appeal  to  a  local  linear  approximation  in 
applying  the  motivating  example  to  the  much 
more  complicated  problem  worked  here.  In  the 
spirits  example,  the  normal  density  I  (p)  is 
centered  at  p  -  .993,  with  a  standard  devia¬ 
tion  of  .053.  Since  the  log  posterior  density 
declined  from  its  mode  to  -<»  over  an  interval 


of  length  .007,  the  grid  is  poor,  indeed,  for 
values  of  p  above  p,  until  m  is  large. 
This  difficulty  could  be  obviated  by  suitable 
transformation  of  p;  but  note  that  the  prob¬ 
lem  affects  only  the  conjectured  -relationship 
of  computation  time  to  m  --  the  validity  of 
the  numerical  procedure  Itself  and  computation 
of  m  T  question. 

Conclusion 

The  results  here  suggest  the  possibility  of 
very  substantial  gains  in  computational 
efficiency  from  acceleration  methods.  More 
Investigation  is  clearly  warranted.  The 
foremost  problem  is  that  full  grids  are 
impractical  in  more  than  one  dimension;  with 
1  grid  points  in  each  of  k  dimensions  and  a 
full  mesh,  computation  time  is  proportional  to 
/v  .  Hence  more  sophisticated  strategies 
for  grid  construction  bear  investigation.  The 
worked  example  was  tailored  to  a  specific 
problem,  and  more  generic  software  is  required 
to  learn  about  appropriate  design  of  grid  and 
antithetic  sampling  in  various  models.  Among 
the  issues  to  be  investigated  are  the 
possibility  of  algorithmic  choice  of  different 
grid  meshes  for  different  parameters,  or  axes, 
to  increase  efficiency;  and  the  potential 
additional  increase  in  efficiency  afforded  by 
suitable  preliminary  transformation  of 
parameters.  A  general  proof  of  the  propor¬ 
tionality  of  computation  time  to  grid  mesh 
would  be  enlightening,  although  it  is  not 
necessary  for  the  implementation  of  these 
methods . 

Reduction  of  computing  time  from  over  ten 
minutes  to  under  ten  seconds  on  desktop 
machines,  as  reported  here,  underscores  the 
fact  that  innovations  in  algorithms  complement 
ever  faster  hardware.  These  complementarities 
will  increase  with  innovations  in  hardware 
architecture.  In  particular,  grid  methods  are 
well  adapted  to  vector  or  parallel  processors, 
because  once  the  random  numbers  for  each  Monte 
Carlo  iteration  are  chosen,  the  evaluation  of 
the  posterior  density  and  functions  of  interest 
at  different  grid  points  typically  involve 
precisely  the  same  computations  (but  with 
different  parameter  values).  Since  vector  and 
parallel  architectures  are  now  accessible,  this 
seems  an  opportune  time  to  pursue  the 
implementation  of  acceleration  methods  on  these 
machines  In  anticipation  of  their  wider 
availability  in  the  intermediate  future. 
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Table  1 


Computation  Times,  Without  Antithetic  Acceleration 


Textiles  example 

Function  of 


Interest 

m-1 

m-2 

m-5 

m-12 

m=25 

m=50 

)ii=100 

E(^l) 

202.231 

181.796 

174.000 

173.215 

175.402 

165.480 

169.534 

E(/32) 

202.437 

184.914 

171.478 

160.684 

164.977 

162.552 

156.317 

E{p) 

178.036 

89.000 

39.032 

18.375 

8.784 

3.860* 

1,680' 

Prediction 

193.592 

146.717 

123.079 

116.280 

115.735 

117  .  563 , 

111 . 185 

p(1p|  <  .1) 

181.952 

66.712 

21.692 

8.994 

4.236* 

2.896* 

0.52  3' 

P(p  >  0) 

156.567 

107.887 

18.792 

14.066 

,6.717* 

3.021* 

2,290' 

20  replications 

*8 . 744 

*17.458 

*34.890 

Spirits  example 

Function  of 

Interest 

m-1 

m-2 

m— 5 

m-12 

m-2  5 

m=50 

m-100 

E(/9p) 

818.197 

803.833 

785.122 

747.708 

727.303 

764.308 

735.952 

E(fi2) 

820.990 

806.643 

769.044 

750. 209 

803.173 

729.340 

768.943 

E(P) 

443.002 

395.034 

144.442 

25.712 

6.163* 

1.437* 

0.405 

Prediction 

911.174 

887.308 

807.634 

755.261 

724.942 

808.616 

787.963 

20  replications 

*11.164 

*22.318 

*44.599 

Computation  times  are  given  in  seconds  for  a  MicroVax  II  using  6A-bit 
arithmetic.  Trailing  asterisks  denote  times  that  imply  fewer  than  20  Moiito' 
Carlo  iterations;  computation  time  for  20  iterations  is  indicated  by  the  figure 
with  leading  asterisk  at  the  bottom  of  the  column. 


Table  2 


Computation  Times,  With  Antithetic  Acceleration 


Textiles  example 

Function  of 


Interest 

m-l 

m-2 

nt-3 

m-12 

m“2S 

II 

1-50 

III" 

■  100 

E(^l) 

1 

606 

0 

959 

0. 

409* 

0. 

191’' 

0. 

088 

0. 

033* 

0. 

012' 

E(^2> 

12 

.072 

12 

,829 

9. 

279 

6. 

080 

4  . 

879* 

3, 

406* 

2  . 

121' 

E(p) 

174 

867 

91 

394 

40. 

863 

18. 

355 

9. 

447 

4  , 

161* 

1  . 

802.' 

Prediction 

60 

,  502 

27 

,948 

9. 

935 

4  . 

063* 

1  . 

787* 

0. 

694* 

0. 

26  5' 

P(|P| 

<  .1) 

173 

.620 

70 

.353 

22, 

259 

9. 

649 

4  , 

,226* 

3. 

051* 

0 

,616 

P(p  > 

0) 

151 

.  789 

113 

.245 

20 

.698 

13 

882 

6 

,856* 

3. 

279* 

T 

,  385 

20  replications 

*1 

.855 

*4. 

415 

•A 

9 

.163 

*18, 

295 

*36 

.  564 

Spirits  example 


Function  of 


Interest 

m-1 

m-2 

m-5 

m-12 

m-25 

111-50 

It "  1  0(1 

E<^l) 

55.486 

A1 .986 

13.208 

Sit 

2.369 

0. 592'f 

0.140'* 

0,0  3‘1 

E(02) 

51.896 

00 

11.833 

2.097* 

0.515* 

0.117* 

0 . 0  3,. 

E(p) 

475.000 

427.113 

143.177 

25.872 

6.550* 

1.495^ 

0,4:>4 

Predi c  t ion 

5.045 

4.138 

1.598* 

0.342* 

0.092* 

0.023* 

0  006 

20  replications 

*2.348 

*5.567 

*11.570 

*23.092 

*46.18? 

Computation  times  are  given  in  seconds  for  a  MicroVax  II  using  l)it 
arithmetic.  Trailing  asterisks  denote  times  that  imply  fewer  tlian  20  Monte 
Carlo  iterations;  coraputatioii  time  for  20  iteration.s  is  indic.ited  by  tlie  fiinue 
with  leading  asterisk  at  the  bottom  of  the  column. 
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Table  3 


Computation  Time  Scaled  by  Grid  Size,  With  Antithetic  Acceleration 


Textiles  example 

Function  of 


Interest 

m-1 

m-2 

m-5 

m-12 

m-2  5 

1 

i_n 

O 

m- 

-100 

E(/3i) 

1 

.606 

1 

.918 

2. 

045 

2 

.287 

2. 

203 

1 

.640 

1 

187 

E(/32) 

12 

.072 

25 

.658 

46. 

,394 

72 

.966 

121 

967 

170 

.  306 

212 

078 

E(p) 

174 

.867 

182 

.788 

204 

.317 

220 

.265 

236 

187 

208 

.068 

180 

196 

Predic 

:tion 

60 

.502 

55 

.895 

49, 

.673 

48 

.752 

44 

683 

34 

.695 

26 

544 

P(|p| 

<  .1) 

173 

.620 

140 

.706 

111, 

.293 

115 

.793 

105 

658 

152 

.  572 

61 

632 

P(P  > 

0) 

151 

.789 

226 

.489 

103, 

.492 

166 

.585 

171 

402 

163 

.939 

238 

467 

Spirits  example 


Function  of 
Interest 

ra-l 

m-2 

m-5 

m-12 

m-2  5 

B 

O 

m-lOO 

E(^l) 

55.486 

83.973 

66.039 

28.429 

14,799 

6.996 

3.895 

E(^2) 

51.896 

82.371 

59.164 

25.169 

12.864 

5.858 

3.385 

E(p) 

475.000 

854.227 

715.884 

310.467 

163.748 

74.763 

42.437 

Prediction 

5.045 

8.275 

7.990 

4.101 

2  .  304 

1 . 130 

0,569 

Figures 

given  are  the 

product 

of  those 

in  Table 

2,  with 

the  corresponding 

value  of  m. 
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MIXTURE  EXPERIMENTS  AND  FRACTIONAL  FACTORIALS  USED  TO  TAILOR  COMPUTER  SIKIXATIONS 


Turkan  K.  Gardenler,  TKG  Consultants,  Ltd. 


ABSTRACT 

Larpe  scale  computer  simulations  are  In  wide¬ 
spread  and  growing  use  In  government,  business 
and  science.  Within  the  Department  of  Defense 
the  use  of  simulation  Is  particularly  crucial 
because  the  real-world  scenario  of  the  battle 
Ceinnot  be  reollcated.  Eiivlronmental  amd  health 
simulations  for  risk  assessment  have  complex 
determinants  of  pollution  and  target  sites. 

Large  number  of  parameters  may  Initially  appear 
to  be  needed  In  simulations,  and  experiment 
designs  achieved  through  response  surface  method¬ 
ology,  can  reduce  the  final  set  parameters  to  an 
efficient  minimum. 

This  paper  presents  the  use  of  several  experi¬ 
ment  design  procedures,  including  fractional  fac¬ 
torials,  mixture  experiments  with  contralned  op¬ 
timization  and  Hadamard  matrices  as  pre-proces¬ 
sors  to  computer  simulations.  These  methods 
have  been  used  by  the  author  to  (a)  minimize 
the  number  of  computer  runs,  (b)  conduct  an 
Input-output  analysis  of  model  subroutines  and 
measures  of  merit,  (c)  check  for  computational 
model  validity,  (d)  design  Interactive  graphical 
evaluation  schemes  for  the  simulation  developer 
and  user.  The  use  of  experiment  designs  as  pre¬ 
processors  resulted  In  cost-savings  as  well  as 
efficient  Interpretations  for  battle  management. 

INTRODUCTION 

As  the  number  of  parameters  Increase  In  com¬ 
puter  simulations,  the  direct  or  Indirect  rela¬ 
tionship  between  Input  and  output  becomes  dif¬ 
ficult  to  quantify.  The  necessary  costs  to  nm 
the  simulation  model  Increases  In  parallel. 

To  a  statistician,  experiment  design  as  a 
discipline  seems  the  most  natursil  way  to  approach 
a  screening  effort  for  relevant  variables.  A 
simulation  setting  Is  the  most  natural  context 
for  collecting  the  relevant  data,  analyzing  and 
Interpreting  them  without  having  to  defend  "mis¬ 
sing  cells,"  In  a  previous  report  Gardenler 
(1982)  Illustrated  the  use  of  statistical  prin¬ 
ciples  In  the  study  of  complex  relationships 
among  simulation  Input  variables.  Since  then, 
considerable  emphasis  has  been  placed  on  formu¬ 
lating  surrogate  models  or  metaraodels — "models" 
of  simulation  models — which  reduce  the  Input- 
output  relationships  to  the  framework  of  a  reg¬ 
ression  equation  (Friedman,  et,  al,  1984j  KlelJ- 
nen,  1982;.  Biles  (1979)  also  stressed  the  Im- 
port£mce  of  using  statistical  principles  In  de¬ 
signing  simulation  runs  and  Interpreting  the  out¬ 
put  of  simulation  experiments. 

The  objective  of  the  present  paper  Is  toi 

(a)  demonstrate  how  the  principles  of  experi¬ 
ment  design,  multivariate  data  analysis  and  opti¬ 
mization  techniques  can  be  applied  to  simulation 
models  I 


(b)  show  relative  efficiencies  among  several 
experiment  design  plans. 

Two  ways  of  structuring  statistical  experiment 
designs  as  pre-processors,  an  Integrated  tool 
denoted  as  Pre-Prim  by  the  author,  will  be  de¬ 
monstrated.  The  first  deals  with  independent 
Input  vector  parameters}  the  second  deals  with 
constrained  mixture  experiments, 

PRE-PRIM  AS  A  NEW  CONCEPT 

Pre-Prim,  as  an  Integrated  set  of  statistical 
tools,  offers  the  capability  of  mechanizing  +he 
decisions  of  the  simulation  user.  It  offers  the 
feasibility  toi 

(a)  auialyze  a  maximum  number  of  Input  parame¬ 
ters  with  a  minimum  number  of  simulation  runsi 

(b)  Incorporates  non-llnearltles  and  synergies 
of  Input  variables  Into  the  mathematical  regres¬ 
sion  model [ 

(c)  assures  stability  and  minimum  variance  in 
the  coefficients  of  the  metamodel  or  surrogate 
model . 

Pre-Prim  also  offers  a  protocol  for  sequencing 
the  simulation  runs.  Thus,  If  trials  were  to  be 
Interrupted  at  certain  nodal  points  during  the 
sequence  of  total  experimentation,  there  would 
be  a  minimal  impact  on  parameter  efficiency. 

Pre-Prim  works  Interactively  with  the  user  In 
order  to  formulatei 

(a)  the  ralnlma/maxlma  of  the  Input  variables 
which.  In  essence,  determine  the  region  of  the 
response  surface  explored} 

(b)  the  nature  of  the  function  relating  Input 
to  output}  l.e.,  whether  the  relationship  can  be 
represented  by  a  linear  function,  second  order  or 
higher  order  polynomial } 

(c)  whether  2,  3»  or  higher  levels  should  be 
associated  with  the  Input  vcirlables,  as  decided 
upon  In  point  (b)  above} 

(d)  what  mode  and  pattern  of  Interactions 
among  Input  variables  need  to  be  explored. 

The  total  number  of  Input  parameters,  con¬ 
sisting  of  main  effects  or  linear  terms.  Inter¬ 
actions,  and  non-linear  terms,  determine  the 
design  matrix.  The  design  matrix  Itself  deter¬ 
mines  the  degrees  of  freedom  available  to  esti¬ 
mate  the  error  variance  In  the  multivariate 
regression  metamodel  or  surrogate  model.  For 
statistical  efficiency  purposes,  It  Is  essential 
to  formulate  these  criteria  prior  to  starting 
simulation  runs  which  estimate  model  sensitivity. 
The  pre-processor  design  matrix  needs  to  main¬ 
tain  the  criteria  of  balance  and  orthogonality. 
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PROTOTYPE  DESIGN  PLANS 

Pre-processing  design  plans  can  be  categorized 
by  the  number  of  "levels"  In  Input  variables  and 
their  particular  mix.  Pre-Prim  has  grouped  them 
Into  the  following  categories i 

(a)  2  or  3-level  screening  designs  estimating 
main  effects  only; 

(b)  2  or  3-level  designs  estimating  Interac¬ 
tions  as  well  as  main  effects) 

(c)  mixed  level  designs  combining  2  and  3- 
level  Input  variables; 

(d)  designs  Involving  a  constrained  sum  of 
preference  decisions,  based  upon  mixture  experi¬ 
ments. 

The  first  three  types  assume  that  the  Input 
variables  relate  as  Independent  vectors  to  the 
output.  The  last  type  of  designs  solve  for  the 
optimal  preference  mix  In  Inputs. 

Full  factorial  designs  estimate  all  main  ef¬ 
fects  and  all  possible  Interactions  up  to  order 
k-1  (k  refers  to  the  number  of  variables).  The 
total  number  of  necessary  simulation  runs  In  a 
full  factorial  design  Is  represented  by  1*',  where 
1  corresponds  to  the  number  of  levels. 

Fractional  factorial  designs  reduce  the  simu¬ 
lation  run  demands  to  a  fraction  of  what  would 
be  needed  for  a  full  factorial  trading,  In  return 
for  estimable  parameters,  economy  In  computer 
run  cost.  Most  main  effects,  and  some  of  the 
most  Important  2  or  3-factor  Interactions  are 
estimated,  the  choice  determined  through  user 
Interface  In  Pre-Prim.  An  example  of  this  type 
of  design,  as  presented  In  BrouOe  and  Gardenler 
(1987)  is  shown  In  Table  1. 

If  no  Interactions,  but  only  linear  main  ef¬ 
fects  are  to  be  explored,  It  Is  possible  to  use 
the  principles  of  Hadamard  matrices  and  estimate 
the  effects  of  up  to  k-1  Input  variables  with  as 
few  as  k  simulation  runs.  These  designs  are 
denoted  as  "screening  designs"  because  they  allow 
for  no  Interactions.  'They  use  only  two  levels 
for  each  factor,  a  feasible  minimum  and  maximum. 
Thus  they  do  not  allow  for  estimation  of  non-llne- 
arltles. 

Pre-Prim  Includes  severaQ.  plans  which  allow 
for  the  estimation  of  non-llnearltles  but  which 
also  result  in  cost-sa\'lngs  similar  to  those 
offered  by  fractional  factorials  and  Hadamard 
matrices.  Table  2  shows  an  Illustration  of  the 
variable/level  combinations  In  these  plans.  For 
example.  In  line  3  “e  see  that  a  total  of  9  fac¬ 
tors  can  be  screened  with  as  few  as  8  simulation 
runsin  this  plan,  7  variables  have  2  levels,  one 
variable  has  3  levels  (thus  allowing  I'or  non-11- 
nearlty  estimation)  and  one  variable  has  U  levels. 
4-level  variables  may  Involve  categorical  data 
such  as  types  of  aircraft. 

In  the  example  above,  if  wo  had  used  a  full 
factorial  design  estimating  only  linear  effects 
for  each  variable,  512  simulation  runs  would 


have  been  required;  Hadamard  matrices  could  have 
reduced  the  sample  size  to  10.  Both'  of  the  above 
plans  would  have  stlmated  only  linear  effects; 
factorials  would  have  given  the  full  set  of  Inter¬ 
actions,  Hadamard  matrices  no  Interactions,  The 
special  Pre-Prim  plams  shown  In  Table  2  require 
only  8  slaulailon  runs  and  enable  non-linear  ef¬ 
fect  estimation  for  2  of  the  9  variables. 

In  constrained  sum  designs  solving  for  opti¬ 
mal  preferences  among  input  variables,  the  effect 
of  3  variables  can  be  determined  with  7  simulation 
inins,  the  effect  of  4  input  factors  with  I5  runs, 
and  the  effect  of  5  variables  can  be  determined 
with  21  runs.  An  example  of  this  application  Is 
shown  below. 

AN  APPLICATION  TO  MAK-IN-THE-LOOP  EAITLE  MANAGEMENT 

Let  us  consider  the  case  of  a  simulation  model 
where  various  reentry  vehicle  (RV)  and  platform 
characteristics  are  being  analyzed  as  to  their 
effect  at  various  phases  of  the  battle;  (a)boost, 
(b)  post-boost,  (c)  midcourse  and  (d)  terminal. 
After  revlewln.g  where  each  relevant  subroutine  Im¬ 
pacts  the  output  parameters  (Kubeja,  1987),  each 
Input  is  related  to  nodes  In  the  battle-phased 
output.  Figure  1  shows  these  results.  For  examp¬ 
le,  platform  characteristics  affect  all  nodal  pha¬ 
ses;  RVs,  Tairget  Type  and  Time  of  RV  Impact  affect 
boost,  midcourse  and  terminal  phases  respectively. 
The  Target  Types  Impacted  also  had  5  options; 

(1)  missile  silos; 

(2)  Cl  sites; 

(3)  bomber  b«es; 

(U)  other  military  targets; 

(5)  urban  Industrial  locations. 

For  a  first  stage  analysis,  three  RV  and  two 
platform  characteristics  shown  In  Figure  1  were 
chosen,  A  second  stage  analysis  was  then  formula¬ 
ted  for  the  five  options  of  Target  type,  holdlns- 
all  other  first  stage  variables  constant.  Our 
decision  was  reachec  after  considering  various 
options  for  pre-processor  design.  These,  and  the 
associated  simulation  run  requirements,  are  shown 
In  Table 

The  two-stage  constrained  sum  design  selected 
represents  less  than  l/lOOO  cf  the  simulation  run 
requirements  of  option  lA,  only  about  4^  of  op¬ 
tion  IB,  S’i  of  option  II,  and  17^  of  option  III, 
Savings  In  computer  run  time  and  related  data  in¬ 
terpretation  are  substantial. 

Results  for  the  design  selected  can  be  analyzed 
in  two  ways.  The  first  Is  the  regression-oriented 
approach  where  the  Input  design  matrix  Is  submlt- 
t<‘l  to  multi i-arlate  regression  and  analysis  of 
variance,  ANOVA.  The  coefficients  obtained  by 
matrix  Inversion  are  (a)  scanned  for  statistical 
significance,  (b)  regression  Is  Implemented  again, 
keeping  the  significant  variables,  (.  )  the  coef¬ 
ficients  are  used  as  the  terras  of  the  metamodel. 
Hypothetical  data  have  been  analyzed  by  this  pro¬ 
cedure  and  are  shown  in  Table  4.  The  results 
show  the  hypothetical  output  using  5  Inpvit  vari¬ 
ables  from  the  5  Input  variables  In  the  Stage  I 
mixture  design  regressed  against  percentage  total 
leakage  during  th.  battle.  All  Input  variables 
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are  statistically  significant,  with  confidence  co¬ 
efficients,  l-<x,  ranging  from  .82  to  .99. 

These  results  are  not  as  Informative  as  the 
query  of  what  mix  among  Input  variables  optimize 
the  criterion  output.  For  example,  one  question 
In  battle  management  Is  to  solve  for  the  number  of 
decoys  which  are  optimal  for  a  specific  attack 
scenarloj  another  Is  the  differential  benefits 
of  the  use  of  maneuvering  versus  electronic  coun¬ 
termeasures,  E!C^;.  In  the  present  application, 
the  query  was  the  optimal  mix  of  preferences  In 
the  utility  function  scale  ranging  from  1  to  10. 

A  prototype  ancilysls  of  hypothetical  results  is 
shown  In  Table  5. 

In  this  example,  the  sum  of  the  weight  prefer¬ 
ences  was  set  to  20.  The  optimal  mix  which  would 
minimize  the  leakage  of  attacking  RVs  Is  shown  In 
the  first  column  under  "value  at  Minimum. "  The 
optimization  module  was  successful  In  maintaining 
leakage  during  battle  at  less  than  .0001.  The 
appropriateness  of  model  fit  was  checked  by  a 
plot  of  residuals  or  deviations  from  the  response 
surface. 

Another  tool  for  evaluating  the  response  sur- 
faceobtalned  from  the  analysis  Is  triad  plots 
(Scheffe,  1983.)  Figure  2  Includes  two  hypothe¬ 
tical  results  for  prototype  diagrams.  Ihese  can 
assist  the  simulation  users  In  deciding  amon? 
various  alternatives.  The  letters  In  the  triad 
plots  refer  to  values  of  the  output  criterion 
variable,  leakasre,  held  at  values  .00  -  .75  In 
Intervals  of  .IS  Individual  codes  are  shown 
next  to  subflirure  I.  The  vertices  In  each  triad 
correspond  to  one  of  the  five  alternative  inputs. 
Three  vertices  are  shown,  two  variables  are  held 
constant.  At  the  comer  of  each  vertex,  maximal 
weight  Is  apportioned  to  that  variable. 

Evsduatlng  sub-figures  I  euid  II  we  note  that 
It  Is  more  eneflclal  to  choose  the  strategy  of 
sub-figure  II.  In  this  hypothetical  dataset,  as 
we  Increase  the  relatl  ve  weighting  scheme  to 
Platform  Resources,  leakage  values  approach  zero; 
we  see  more  C-coded  vailues  In  contrast  to  the 
D-  and  E-coded  vcilues  we  noted  In  sub-figure  1. 

The  essential  difference  between  sub-figures  I 
and  II  Is  that  Target  Type  became  a  player  In 
Bubflgure  II,  replacing  Number  of  RVs  In  sub¬ 
figure  I. 
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Interactive  graphics  of  this  type,  combined 
with  pre-processors  and  surrogate  modeling- 
related  analytical  techniques,  can  aid  man-ln-the 
loon  related  strategy  decisions. 

CONCLI’DIN'C  REMARKS 

This  paper  has  demonstrated  how  statistical 
experiment  design  principles,  used  as  a  Pre-Prim 
Interface  to  large-scale  simulations,  can  dras¬ 
tically  reduce  the  simulation  run  costs.  Reg¬ 
ression  oriented  surrogate  modeling  r  iF.etamode- 
llng  can  mathematically  represent  the  relation¬ 
ship  between  input  and  output  variables.  Graph¬ 
ical  techniques  and  optimization  algorithms  can 
solve  for  the  best  mix  among  strategies,  thus 
helping  the  user  In  tradeoff  decisions. 
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TABLE  1 

PROTOTYPE  FRACTIONAL  FACTORIAL  DESIGN 
USING  SIX  INPUT  FACTORS  AND 
EIGHT  TWO-FACTOR  INTERACTIONS 
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4 

29 

• 

- 

4 

4 

4 

4 

4 

- 

- 

. 

. 

- 

. 

30 

4 

• 

4 

4 

4 

- 

- 

4 

4 

4 

- 

- 

- 

31 
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4 

4 

4 

4 

• 

. 

• 

- 

- 

4 

4 

4 

4 

32 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

Table  ? 

Nonlinear  Preprocessor  Hun  Requirements  in  Some  Pre-Prim  Designs 


Simulation  Runs 

Variables  (V)/  Levels  (L) 

Total  Variables 

V  L 

V 

L 

V 

_L 

9 

4  3 

4 

2 

8 

18 

7  3 

7 

2 

14 

8 

1  4 

1 

3 

7 

2 

9 

16 

5  4 

5 

3 

15 

2 

25 

32 

9  4 

9 

3 

31 

2 

49 

596 


Figure  I 


DECISION  WEIGHT  CRITERIA 

Simulation  Phases 


P 

H  A 

S  E 

Criterion 

Post 

Mid 

Boont 

Boost 

Course 

Terminal 

RV  CHARACTERISTICS 

t  #  RVs  on  Bus 

-- 

-- 

Time  before 

Booster  Burnout 

X 

-- 

-- 

-- 

•  Target  Type 

-- 

-- 

•  Time  RV  Impact 
PLATFORM  CHARACTERIST. 

•  Kill  Probability 

•  • 

““ 

® 

[n — 

X 

X 

X 

•  Platform  Resources 

u_ 

X 

X 

X 

Table  3 

Possible  Alternative  Pre-Processor  Designs  for  Man-in-the  Loop 


Pre-Processor  Alternative 


Number  of  Simulation  Runs 


I.  FULL  FACTORIAL:  2  STAGES 


A.  3-level  variables 


B.  2-level  variables 


10 

(3)  •  177,147 

10 

(2)  •  1.024 


II.  FULL  FACTORIAL  AND  CONSTRAINED  SUM 

III.  KADAMARD  SCREENING  AND 
CONSTRAINED  SUM 

IV.  TWO-STAGE  CONSTRAINED  SUM 

Table  k 


S 

(2  X  21) 


X  12) 


(2  X  21) 


•  672 


-  252 


42 


Regression  Coefficients  for  LEAKACEPCRCENT 


Coefficient 

Term 

Standard 

Error 

T-Value 

Confidence 
Coef  <>  0 

0.2998 

RVKUHBER 

0.1764 

1.700 

89.5% 

0.5780 

TARGETTYPE 

0.1764 

3.277 

99.7% 

0.7219 

IMPACTTIMS 

0.1764 

4.092 

99.9% 

0.9697 

KILLPROB 

0.1764 

5.497 

99.9% 

0.2520 

PLATFORHRESOUCB 

0.1764 

1.429 

82.4% 

Confidence  figures  are  based  on  16  degrees  of  freedoe 


Analysis  of  Variance  for  LEAKACKPERCEKT 


Source 

df 

SS 

MS 

F-Ratio 

Total  (corrected) 

20 

1.3001 

Regression 

Residual 

4 

16 

0.4980 

0.8021 

0.12451 

0.05013 

2.484 

(1)  Implies  91.4%  confidence  regression  equation  is  nontero. 
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Tablo  5 


MINIMUM  LEAIUCEPERCENT 


A  ninlmua  oC  0.000072  was  achieved  under  the  following  conditions. 


Value  at 

Lower 

Upper 

Mininuffl 

Factors 

Limit 

Limit 

0.484 

RVNUMBER 

0.000 

1.000 

0.609 

TARCETTYPE 

0.000 

1.000 

0.466 

IMPACTTIHE 

0.000 

1.000 

0.0468 

RILLPROB 

0.000 

1.000 

0.778 

PLATFORMRESOUCB 

0.000 

1.000 

LEARACEPERCENT . RES 


0.200 

IL 

1 

1 

0.160 

IL 

1 

1 

0.120 

IL 

1 

1 

0.080 

• 

0.040 

! 

• 

i LLLLLLLL 

! 

8 

0.000 

tLLLL 

1 

4 

>0.040 

• 

t  LLLLL 

1 

5 

>0.080 

* 

IL 

1 

1 

-0.120 

— 

Figure  2 

Prototype  Diagram  of  Hypothetical  Data  Used  to  Illustrate  Use  of  Triad  Contours 


r 

learagepercekt 


RVNUMBER  •  0.6000 
IMPACTTIME  ■  0.0000 
PLATFORMRESOUCB  -  0.0000 


.8. 

A:  0.000 

.EB 

•  B:  0.190 

.B 

.  C:  0.300 

D.  D:  0.450 

.8 

CD  .  E;  0.600 

.  EEB 

DO  C.  F:  0.750 

EEB 

OD  CC. 

EEB 

DO  CC  B. 

• 

BEE 

DO  CC  BB. 

•  FP 

EEB 

DDOCCCBBAA* 

RVNUMBER  ■ 

0.0000 

RVNUMBER  •  0.0000 

IMPACTT1M8  • 

0.6000 

IMPACTTIM8  •  0.0000 

PLATFOMIKESOUCS  • 

0.0000  FLATrORHRESOUCE  -  0.6000 

TARCETTYPB  -  0.2000 
KILLPROB  •  0.2000 


-r 

LSARACSPERCEHT 


TARCETTYPB  ■  O.SOOO 
IMPACTTIHE  -  0.0000 
PLATFORMRESOUCB  •  0.0000 

t  ■ 

.0. 

.D 

.  CC. 

.  CCC  . 

.  CC  BB. 

.  CCC  BB  . 

.  CC  BB 

CCC  BB  A. 

.C  CCCC  BBB  A. 

»CC  CCCC  BBB  A* 


TARCETTYPB  ■  0.0000 
XMPACTTZM8  ■  0.6000 
PLATFORMRESOUCB  -  0.0000 


TARCETTYPE 

IMPACTTIME 

PLATPORMRESOUCE 


0.0000 

0.0000 

0.6000 


RVNUMBER  •  0.2000 
RZLLPROB  •  0.2000 
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SIMULATION  AND  STOCHASTIC  MODELING  FOR  THE  SPATIAL  ALLOCATION  OF  MULTI-CATEGORICAL  RESOURCES 


Richard  S.  Segall,  Uni^’ersity  of  Lowell,  Lowell,  MA  01854 


I .  Background 

This  paper  extends  a  mathematical  model  call¬ 
ed  DRAM  (Disaggregated  Resource  Allocation 
Model)  which  was  formulated  by  Venedictov  et  al. 
(1977)  and  later  refined  by  Gibbs  (1978)  and 
Hughes  (1980)  at  IIASA  (international  institute 
for  ^plied  ^sterns  Analysis)  in  Laxenburg, 
Austria . 

Even  though  the  DRAM  model  was  a  product  of 
the  Health  Care  Systems  Modeling  Task  Force  at 
IIASA,  its  applicability  to  other  types  of 
resources  is  unlimited.  Basically,  the  DRAM 
model  is  a  simulation  model  which  predicts  how 
a  large-scale  capacity  system  with  constraints 
on  supply  would  respond  when  resource  avail¬ 
ability  changes. 

Mayhew  (1981a)  further  extended  the  DRAM 
model  to  account  for  multi-specialty  modeling 
of  patient  flows  over  a  geographical  region  with 
a  model  called  DRAMOS  (Disaggregated  Resource 
Allocation  Model  O^ver  ^pace).  The  DRAMOS  model 
is  really  a  hybrid  model  between  DRAM  and  a 
model  called  RAMOS  (Resource  Allocation  Model 
Over  ^ace)  which  was  developed  and  successfully 
tested  by  Mayhew  (1980a,  1980b)  to  model  single 
category  flows  as  an  aggregate  over  geographical 
regions  in  England  and  other  countries  with 
capacity  constraints. 

Segall  (1982,  1983,  1984,  1987a,  c)  and 
Rising  ct  al.  (1984a)  further  extended  the  RAMOS 
model  and  successfully  apr.li«d  it  to  actual  data 
for  the  State  of  Massachusetts  as  representative 
of  a  large  scale  system  which  is  not  capacity 
constrained,  but  rather  is  affected  by  market 
forces . 

This  paper  intends  to  refine  the  DRAMOS  model 
for  application  to  market  systems  by  assuming  a 
demand  constraint  on  the  origin  of  the  consumer 
instead  of  a  supply  constraint  for  the  place  of 
economic  consumption.  This  is  analogous  to  the 
distinction  between  the  destination  and  origin 
constrainf.d  forms  of  the  RAMOS  model  as  formu¬ 
lated  by  Mayhew  (1980a).  Additionally,  pro¬ 
babilistic  assumptions  are  made  on  certain 
parameters  of  the  modeling  to  make  a  stochastic 
nature  of  its  applicability  possible. 


I r .  Mathematical  Modeiing 


A.  An  Origin  Constrained  DRAMOS  Model 


Below  is  an  origin  constrained  formulation 
of  the  DRAMOS  model  which  is  an  •'xtension  of 
the  Mayhew  (i9Hla)  destination  constrained  model. 

Define ; 

i  =  origin  zone  (I-l  i  m) 

)  =  destination  zone  { 1  g.  ]  n) 


k  =  specialty  category  (Ick  _}>) 

T,  =  flow  from  origin  i  tn  rh-stination 
in  category  k 


-  avc'r.ig^}  length  of  stay  for  <-ato''jory  k 
f rom  origin  i 


W, 


ik 


'lemand  from  origin  i  of  catf'oory  k 


=  total  capacity  demanded  by  origin  i 

=  maximum  flow  from  origin  i  to  destination 
j  in  category  k 

L^j^  =  maximum  length  of  stay  for  category  k 
fr'~>m  origin  j 

Model  Objectives: 

To  evaluate  the  values  of  Tij)^  and  that 

satisfy  the  following  equations  (1)  and  (2) 
subject  to  constraints  (3)  and  (4): 


w.,  =  I 

T. 

(il 

Ik  j 

ijk 

1  W  <L 

=  S 

(2) 

k  ik  ik 

1 

Vijk 

(3) 

o-nk  " 

Lik 

(4) 

Below  are  deterministic  and  stochastic 
versions  of  a  non-linear  j^reference  function 
originally  developed  by  Hughes  (1930)  for  the 
consumer's  zone  i  of  residence,  Ike  optimiza¬ 
tion  problem  is  formulated  with  either  of  these 
two  versions  of  the  objective  function  subject 
to  the  constraints  given  by  equations  (3)  and 
(4)  above. 

The  first  version  is  used  when  the  service 
and  demand  benefit  functions  are  known  precisely. 
This  situation  is  modeled  below  by  equations 
'6)  and  (7) ,  and  requires  the  knowledge  of  an 
immense  amount  of  parameters;  which  is  usually 
not  the  case.  The  second  version  overcomes 
this  difficulty  by  allowing  both  the  preassign- 
ing  of  parameter  values  as  well  as  the  pro¬ 
babilities  of  parameter  values.  The  latter 
version  is  useful  for  planning  of  large  scale 
systems  when  parameter  values  are  subject  to 
change  over  the  planning  horizon  rather  than 
being  held  fixed. 

Do  f  i  ne  : 

=  relative  importance  parameter  of 

servicing  maximum  flow  of  si)ccialty  k 

from  origin  i  (a.,  -0). 

ik 

=  relative  importance  parameter  of  having 
maximum  demand  for  specialty  k  from 
origin  i  0)  - 

q(T.  .  )  =  service  benefit  functioi\ 

13K 

h(vik)  =  demand  benefit  furiction 

=  marginal  unit  cost  of  demand  in  each 
origin  zone  i 

VERSION  1:  DETERMINISTIC  NONLINEAR  OBJECTIVE 
FUNCT TON 


U.  (T, :) 


g(T.  ) 
Ilk 


T  ■ .  h  ( c , ) 

1  ]k  IK 


(S) 


who  rc 
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Analogous  supply  driven  expressions  for  the 
above  benefit  functions  of  equations  (6)  and 
(7)  can  be  found  in  Gibbs  (1978,  p.8-9) . 


VERSION  2:  STOCHASTIC  NONLINEAR  OBJECTIVE 
FUNCTION 

It  should  be  recognized  that  realistic  model¬ 
ing  requires  integer  values  for  the  variable 
'^ijk'  counts  the  number  of  consumers 

migrating  from  i  to  j  for  commodity  or  service 
type  k.  That  is,  T^jj^  is  a  discrete  variable 
as  given  by 

Tijk  =  0,  1,  2 . n  (8) 

We  can  assign  probabilities  for  the  values 
of  Tijj,  as  being  equal  to  each  of  these 
integer  values,  by  the  following 

=  ^ . "] 

n  (9) 

n  (10(a)) 

=  2  P.  ... 

1=0 


where  0  <  and  ^^21  =  1.0. 

(10(b)) 

Similarly  we  can  extend  a  probabilistic 
interpretation  for  the  nonlinear  objective 
function  by  taking  probabilities  of  both  sides 
of  equation  (5)  : 

p[ui (T,  pj  j  =  gf'^ijk’] 

=  f y<Tijk>l  +  4'^ijk 

kj  '■  kj  ^  ■* 

Using  equation  (6) , 

P  [g'Tijk)]  =  P(^ijk=^)-  P(Ci  =  B).P(L.^=C)- 
-1 

P(“.,  =D)'  P(0,  ,,  ^k  =[r).  p(T.  ik  . 

ik  *^ijk  ijk 

v/hero  A,B,C,D,E,  and  F  are  prespecified  values 
for  given  i,j,  and  k. 

Taking  summations  of  equation  (13)  yields: 


^  Because  usually  the  values  for  and 

ik  will  be  given  assumptions  of  the'^problem; 
their  associated  respective  probabilities  would 
be  1.0  and  hence  further  simplification  of 
equation  (14)  would  be  possible. 

Using  equation  (7), 

P  [Tijk  •  h(P,ik)]  =  P(Tijk=G)-  P(h(S^ik)=H)  (15) 

for  prespecified  values  of  G  and  H  for  given 
i , j ,  and  k . 

Taking  summations  of  equation  (15)  yields: 

p[Tijk-h(^ik)]  =21  P^.^.  P„.^  ae, 

Combining  equations  (14)  and  (16)  yields  the 
general  form  for  the  stochastic  version  of  the 
nonlinear  objective  function. 


B.  Mathematical  Solution  to  Model 
1.  Overview 

This  paper  will  only  present  the  mathe¬ 
matical  solution  to  the  model  with  the  deter¬ 
ministic  nonlinear  objective  function.  Solution 
of  the  stochastic  version  would  be  quite 
analogous.  Below  is  a  concise  modified  deriva¬ 
tion  based  upon  work  of  Gibbs  (1978)  and  Mayhew 
(1981a)  for  the  case  of  constrained  demand, 
which  is  really  the  scenario  in  the  United 
States  for  whose  application  the  original 
models  were  not  intended.  The  reader  is  refer¬ 
red  to  Gibbs  (1978)  and  Mayhew  (1981a)  for  a 
more  detailed  derivation  of  the  solution  to  tiie 
supply  constrained  model. 

Using  standard  optimization  techniques  for 
constrained  functions  with  Lagrangian  multi¬ 
points  for  l^ifm,  we  form  the  Lagrangian: 


H.(T,a,X)=U.  (T,i)  ^  X.  (S.  -  I  .  (17) 

To  maximize  the  nonlinear  preference  objective 
function,  if  is  necessary  to  solve  the  equations: 


S  Hi 

rTf -  =  0  for  all  i,j,k  (18) 

ijk 

all- 

- =  0  for  all  i  and  k  (19) 


an. 

— =  0  for  all  i  (20) 


Equations  (17),  (5)  and  (18)  yields: 


dg(T,  .,_) 

_ _ ijk- 

^  T.  ,. 

Ilk 


^fk  -  "^"ik’ 


(21) 


Equations  (17),  (5),  and  (19)  yields: 


dh  ( f  ) 

ik  -  X  .  W  =  0 

-  1  ik 


d  f 


ik 


(."’2) 
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Substituting  equation  (7)  into  equation  (22) 
yields  upon  rearrangement: 


ik 


L  X 

ik 


(23) 


Substituting  equations  (6)  and  (7)  into  equa¬ 
tion  (21)  yields: 

_ 1 


T-  1, 

igk 


4>.  .^(11.^) 

i]k  ik 


(a  +1) 

ik 


(24) 


whe  re 


ik 


1 


+  1) 


i 


-1 


(25) 


Equations  (23),  (24),  and  (25)  provide  tlie 

solutions  to  the  decision  variables  which  mini¬ 
mize  the  nonlinear  objective  function  given  by 
equation  (5)  subject  to  the  constraints  given 
by  equations  (2),  (3),  and  (4). 


2.  Parameter  Estimation  by  Log-Linear 
Regression 

The  empirical  elasticities  for  the 
lengths  of^stay  (b^^)  and  the  number  of  ad¬ 
missions  to  the  facilities  from  each 

origin  i  of  each  category  k  can  be  evaluated 
using  log-linear  regression  as  described  below. 

Taking  logarithms  of  equation  (2)  with  res¬ 
pect  to  the  variables  and  respectively, 
and  extending  all  variables  into  dimension  of 
time  t  yield  the  following  two  equations: 


log 

0 -  ''  P 

Pikt  =  a 

*  “"ik 

(log 

"it> 

+  ^it 

(26) 

log 

-B - 

Wikt  =  a 

C-  W 
^  “^ik 

(log 

^) 

+  Zit 

(27) 

In  equations  (26)  and  (27  J,  and  Zj^j.  are 

stochastic  error  terms;  a''  and  a^  are  constants; 
and  ^ikt'  ^it  actual  observations 

on  average  lengths  of  stay,  generating  factors, 
and  total  capacity  respectively,  as  consumed  by 
those  originating  from  zone  i  in  specialty  )'. 
within  planning  horizon  of  duration  t.  The 
slope  coefficients  b.^  and  b^^  Qf  equations 
(26)  and  (27)  respectively  are  precisely  the 
empirical  "elasticities"  as  defined  previously. 

C.  Algorithm  for  Parameter  Estimation  of  Origin 
Constrained  Model:  Both  Deterministic  and 
Stochastic  Versions 


1.  Estimate  T.  using  origin  constrained 
model  as  formulated'^  by  Mayhew  (1980a). 

2.  Estimate  empirical  elasticities  by 
log-linear  regression. 

3.  Determine  which  parameters  can  be  esti¬ 
mated  or  if  probabilistic  assumptions  need  be 
applied,  i.e.  select  deterministic  or  stochastic 
version . 

4.  Using  these  parameter  values  predict 
and  Tj^ji^  solving  equations  analogous  to 
equations  (23)  and  (24)  for  either  version.  It 
nay  be  useful  to  perform  sensitivity  analysis 
for  predicting  and  under  varying  com¬ 


binations  of  and  capacity 


S  .  . 
1 


III.  Some  Results  of  Simulation 

The  goodness  of  the  parameters  estimated 
are  determined  by  performing  simulations  on 
actual  input  data  and  comparing  these  predicted 
Tj^jj^  flows  with  the  actual  flows.  Several 
standard  statistical  tests  can  be  used  to 
determine  the  best  set  of  parameter  values.  In 
Table  1  below,  results  are  presented  using  the 
R^  statistic,  which  as  usual  gives  the  propor¬ 
tion  of  explained  variance. 

In  Table  1,  some  results  are  presented  for 
simulation  runs  of  using  the  deterministic 
DRAMOS  model  in  its  destination  constrained 
form  with  data  representing  hospital  discharges 
in  multi-category  specialties  for  the  State  of 
Massachusetts  in  1978.  Three  parameter 
estimation  techniques  were  used  for  model 
calibration  as  shown  in  Table  1:  slope=1.0 
calibration,  maximum  likelihood,  and  maximum 
r2.  These  simulations  show  that  the  maximum 
r2  calibration  method  yielded  the  highest  R^ 
values  and  maximum  likelihood  generally 
yielded  the  lowest.  In  Table  1,  the  parameter 
p  is  the  calibration  coefficient  value  of 
multi-categorical  extension  for  model  of  Mayhew 
(1980a),  which  yielded  the  corresponding  R^ 
value . 

Table  1:  Calibration  of  DRAMOS  using  1978  in¬ 
patient  discharge  data  from 
Massachusetts 


SLOPE=1.0 

CALIBRATION 


Category 
of  patient 

care 

Number  of 

patient 

discharges 

r2 

Total  (all 

851760 

.1600 

.8407 

patients ) 

Medical- 

658942 

.1600 

.8395 

Surgical 

Obstetric- 

88192 

.  1900 

.8920 

Mate  rnity 

Pediatric 

84  391 

.1500 

.7678 

Psychiatric 

20182 

.  1900 

.8635 

Category 

MAXIMUM  LIKELIHOOD 

of  patient 

CALIBRATION 

"3 

care 

P 

r" 

Total  (all 

.1175 

.8577 

patients) 

Medical- 

.1151 

.7998 

Surgical 

Obstetric- 

.1408 

.8588 

Maternity 

Pediatric 

.1112 

.7245 

Psychiatric 

.1392 

.8664 
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Category  of 

patient 

care 

MAXIMUM  r2 

P 

2 

R 

Total  (all 

.6100 

.8803 

patients) 

Medical- 

.6100 

.8759 

Surgical 

Obstetric- 

.5100 

.9218 

Maternity 

Pediatric 

.4100 

.8350 

Psychiatric 

.8100 

.9053 

IV.  Conclusions 

and  Future  Directions 

This  research  extends  a  simulation  model 
for  predicting  multi-categorical  flows  within 
large-scale  market  oriented  systems.  Both 
deterministic  and  stochastic  simulation  ver¬ 
sions  have  been  presented  with  some  results 
for  the  former  version. 

The  future  directions  include  more  exten¬ 
sive  simulation  runs  for  the  deterministic 
version.  Also  the  mathematical  solution  to 
the  stochastic  version  of  the  model  should  be 
completed  in  order  to  provide  some  results  of 
stochastic  simulation. 
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A  Monte  Carlo  Assessment  of  Cross-validation  and  the  Cp  Criterion 
for  Mode)  Selection  in  Multiple  Linear  Regression 


Robert  M.  Boudreau,  Virginia  Commonwealth  University 


1.  Introduction 

Consider  the  situation  of  fitting  a  multiple  linear  regression  to 
a  set  of  data.  The  data  consists  of  n  observations  on  some 
response  variable,  together  with  corresponding  observations  on  p 
predictor  variables.  The  ultimate  use  of  the  fitted  model  will  be 
as  a  prediction  equation.  The  current  data  is  to  be  used  to  assess 
and  select  the  “best”  subset  of  the  predictor  variables,  and  to 
provide  estimates  of  the  regression  coefficients  for  these  variables. 
“Best”  might  be  defined  in  terms  of  smallest  mean  squared  error 
of  prediction  (MSEP),  or  smallest  mean  absolute  deviation  of 
prediction  (MADP).  Keep  in  mind  that  the  “true”  model  is  not 
necessarily  the  “best”  model  for  prediction  purposes  (p  248 
Montgomery  and  Peck,  1982).  The  goal  here  is  different  than 
the  model  building  of  a  researcher/scientist  seeking  to  explain 
and  understand  the  relationships  between  the  predictors  and  the 
response  variable.  There  the  model  sought  should  contain  the 
correct  or  complete  set  of  regressors,  with  accurate  estimates  of 
the  coefficients.  The  distinction  is  useful  because  the  different 
criteria  used  for  model  selection  are  usually  motivated  by  and 
align  with  one  or  the  other  of  these  intended  uses  of  the  fitted 
model. 

2.  Unconditional  vs  Conditional  MSEP 

No  fitted  model  ever  contains  the  exact  values  of  the 
parameters,  since  these  are  estimated  from  the  data.  The 
parameter  estimates  are  unbiased  if  we  overfit,  and  biased  as 
estimates  of  the  true  values  of  the  parameters  if  we  underfit  (p 
247  Montgomery  and  Peck,  1982).  This  unbiasedness  or 
biasedness  refers  to  the  average  behavior  of  parameter  estimates 
averaged  over  many  researchers,  studies,  or  data  sets.  Similarly, 
a  prediction  equation  is  unbiased  or  biased  on  average  for  a 
response  to  be  predicted  depending  on  whether  we've  overfit  or 
underfit. 

In  terms  of  squared  error  of  prediction,  there  is  an  average,  or 
unconditional  mean  squared  error  of  prediction  (MSEP)  in  using 
a  particular  subset  of  variables.  Mallows  Cp  (Mallows,  1973) 

(p  252  Montgomery  and  Peck,  1982)  is  often  motivated  as  an 
estimate  of  MSEP  when  the  predictors  are  fixed. 

A  natural,  related  question  arises.  Since  the  parameter 
estimates  are  not  exact,  and  are  conditional  on  the  training  sets, 
which  estimated  model  best  predicts  new  responses?  The  new 
responses  adliere  to  the  “true”  model,  while  our  fitted  mo<lel 
differs  conditional  on  the  training  data  (see  below). 


This  seems  more  realistic  in  the  sense  that  it  will  be  our 
parameter  estimates  that  will  be  used.  One  would  like  to  pick 
the  fitted  model  with  the  smallest  conditional  mean  squared 
error  of  prediction  (CMSEP). 

The  results  of  sections  3  and  4  show  that  the  Cp  criterion 
(fixed  predictors)  and  cross-validation  (random  predictors)  are 
uncorrelated  with  the  CMSEP's  for  the  fitted  models.  These 
criteria  therefore  cannot  be  interpreted  as  estimates  of  the 
CMSEPs  for  a  particular  data  set.  !  point  out  here  that  these 
results  are  for  multiple  linear  regression  with  normal  errors. 
Cross-validation  has  wider  application.  It  is  an  op>en  question 
whether  cross-validation  is  uncorrelated  with  CMSEP  for  more 
general  regression  functions,  errors,  and  approximating  prediction 
functions  (Efron,  1983). 

3.  Fixed  Predictors;  Cp 

Let  the  current  training  data  for  assessing  and  selecting  a 
prediction  equation  satisfy  the  following; 


y  =  X  /?  -(-  (  (1) 

nx 1  nxp  px 1  n  X 1 

where  y  is  a  vector  of  responses,  X  is  a  fixed  full  rank  matrix  of 
predictors,  and  the  elements  of  (  are  iid  N(0,(T').  Predictors  for 
a  submodel  proceeds  as  follows.  Select  a  subset  of  k  variables, 
form  by  including  only  columns  of  X  from  these  variables, 
and  fit  a  prediction  equation  by  least  squares: 

•  =  ’^k^  =  Pk>-  (2) 

where 

'’k  =  -^kK^k>'’^ky 

Paralleling  Efron  (1986),  consider  predicting  new  studies  to  \)c 
conducted  with  the  same  design  matrix  X  as  the  training  data, 
with  responses  y^. 

Vo  =  (3) 


0  3  4  6  R  10 


'Fhen  the  (’MSEP  for  the  training  set  (1)  in  predicting  new 
responses  is  given  by 

CMSEP,^  =  F.,„[||y„  - 

=  i  ||X,'<  -  (.1) 

where  is  fixed.  Fhen  averaging  over  training  data  sets  (1) 
yields  the  average  (^MSEP 

MSKP^  =  KflCMSKP;.]  =  i||X.)  -  P^X.i||'  +  a'  (f,) 
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MSEPj^  is  the  mean  squared  prediction  error  for  the  subset  of  k 
variables  if  all  researchers  use  these  variables. 

As  pointed  out  by  Efron  (1986),  the  statistic 

Ck  =  B  lly  -  (6) 

is  equivalent  to  Mallow’  Cp  for  the  k  variables, 
where  =  „i.p  ||y  —  X^||^  (using  all  predictors) 

=  n4p  ||y  -  pyll^ 

and  P  =  X(X'X)'‘X'. 

It  is  easy  to  show  that 

E[C,^]  =  E[CMSEPj^]  =  MSEPj^ 

Thus  C|^  unbiasedly  estimates  MSEPjj  =  EfCMSEPjj]. 

Asymptotically  Stone  (1977)  showed  that  the  Cp,  AIC 
(Aikaike,  1973)  and  Cross-validation  (CV)  (Stone,  1974)  are 
equivalent.  Nishii  (1984)  showed  that  asymptotically  the  Cp 
and  CV  don’t  underfit,  but  they  do  overfit  with  non-2ero 
probabilities.  Li  (1987)  showed  that  the  Cp  and  CV 
asymptotically  yield  the  best  CMSEP. 

For  smaller  n,  are  Cp  and  CMSEP  related?  First  note  that 
for  training  data  (1),  Cj^  of  (6)  expands  as 

Ck  =  -iy'(i-Pk)y  +  ^)y'(i-P)y 

=  l/j'x'(I-P^)X/?  +  i'c'(I-P,^)c 

+  2^'X'(I-P^)c  +  j;^jc'(l-P)r  (8) 

The  random  part  of  Cj^  involves  a  sum  of  two  quadratic  forms 
and  a  linear  form  in  i.  Similarly,  CMSEPjj  of  (4)  expands  as 

CMSEP^=  i||X/J  -  P,^(X/3  +  c)||'  + 


=  i^l'x'(l-P^)X/?  +  (T’  +  ir'P,^r  (9) 

Observe  that 

'’k('-f’k)  =  f’k-'’k  =  ” 

’’kC-'’'  =  f’k-f’k  =  0 

Next  noting  conditions  for  independence  of  quadratic  and 
linear  forms  in  normal  variables  (Rao,  1973),  we  have  that 
and  CMSEP|^  are  in  fact  uncorrelated  (independent).  must 
exclusively  be  considered  an  estimate  of  MSEPj^  =  F^[CMSEP|j|, 
not  T'MSEPjj.  Mallows  Cp  recommends  a  set  of  variables  to  the 
wider  community.  is  an  estimate  of  the  MSEPjj.  Whether 
your  CMSF’P|^  for  those  variables  is  higher  or  lower  than  average 
(MSEP|j)  you  don’t  know. 

4.  Random  Predictors:  Cross-validation. 

Next  consider  developing  a  linear  least  squares  prediction 
equation  when  the  matrix  X  of  predictors  in  model  (I)  is 
random.  This  is  usually  the  case  in  practice.  When  parametric 
mo<lel  (1)  is  true,  then  various  criteria  derived  assuming  this  are 
available  and  appropriate.  .Sp(reviewed  by  Hocking,  1976  and 
Thompson,  1978)  is  an  estimator  of  the  unconditional  mean 
squared  error  averaged  over  multivariate  normal  predictors  and 
response.  It  is  directly  analogous  to  Cp  for  fixed  pre<lictors. 


Also  available  are  an  AIC  assuming  multivariate  normality 
(Aikaike,  1973)  and  an  AIC  without  assuming  a  particular 
covariance  structure  for  X  (comment  by  Aikaike  on  Rao,  1987). 

For  a  more  general  (possibly  non-linear)  unknown  true 
regression  function  of  y  on  X,  assessing  the  performance  of  a 
linear  prediction  equation  cannot  be  assessed  by  the  above 
criteria.  Various  non-parametric  estimators  of  the  prediction 
error  have  been  proposed,  including  the  jackknife,  the  bootstrap 
(Efron,  1979  and  Efron  and  Gong,  1983),  and  cross-validation 
(Allen,  1971;  Stone,  1974;  Geisser,  1975).  This  paper 
investigates  a  property  of  cross-validation  as  a  method  of 
assessing  a  prediction  equation  when  the  multivariate  normal 
linear  model  holds. 

The  version  of  cross-validation  considered  here  is  the  leave- 
one-out  at  a  time  method.  One  observation  and  the 
corresponding  set  of  predictors  is  omitted.  The  remaining  data, 
consisting  of  n  — 1  observations  on  responses  and  predictor 
variables,  is  used  to  fit  a  least  squares  linear  prediction  equation 
(or  any  general  prediction  equation).  The  regression  parameter 
^timates  thus  obtained  are  used  to  predict  the  omitted  response. 
The  squared  difference  is  recorded.  The  process  is  repeated, 
omitting  each  response/predictor  pair  temporarily  until  each 
response  has  been  predicted  using  the  remaining  n  — 1.  The 
process  mimicks  the  process  of  predicting  new  observations.  The 
apparent  error  rate 

^lly-Xk\ll^  (10) 

which  is  closely  related  to  the  usual  regression  mean  residual  sum 
of  squares,  is  known  to  considerably  underestimat"  the  actual 
prediction  squared  error  in  the  random  regressor  cas'  (Efron, 
1986).  This  stems  from  two  problems.  The  data  is  used  to 
predict  itself,  so  tends  to  be  optimistic.  Further,  because  future 
predictors  are  random,  the  training  set  of  predictors,  X,  doesn't 
represent  the  full  variation  of  future  observations  unless  n  is  very 
large.  This  underestimation  is  one  of  the  basic  motivations  for 
bias  correction  such  as  the  jttekknife  and  the  bootstrap. 

Let  be  the  matrix  composed  of  columns  of  X 

corresponding  to  variables  in  some  subset  of  k  of  the  total  p 
predictor  variables,  but  with  the  ith  row  deleted.  Similarly,  It. 
y^.^  be  the  vector  of  responses  with  the  ith  response  omitted. 

Then  fitting 

>'(i)  =  huA  + ' 

yields 

^k(i)  =  (Xk(i)Xk)'‘x;,(i)y(i)  (H) 

The  least  square  predictor  of  the  omitted  response,  y^,  using 
the  above  estimated  parameters  from  the  remaining  n  — 1 
observations,  is  the  product  of  the  observations  on  the  k 
predictors  for  the  i-th  response  times  the  corresponding  regression 
coefficients  (11).  Denote  the  predictor  by  yj.  j,. 

The  average  of  the  squared  prediction  errors,  called  PRF^SS  by 
Allen  (1973),  and  called  CVAE  by  Rao  (1987)  abbreviating 
Stone's  (1974)  cross-validalory  assessment  error,  will  be  denoted 
here  as 

CV^  = K(yi  -  (12) 

As  was  pointed  out  by  A.P.  Dawid  in  his  comment  on  Stone’s 
1974  paper  on  croas-validation.  and  also  by  Efron  (1986),  CV^  is 
an  unbiased  estimate  of  the  unconditional  mean  squared  error  of 
prediction  when  selecting  a  training  set  of  random  predictors  of 
size  n  — 1  to  develop  a  prediction  equation,  then  using  that 
equation,  to  predict  new  observations.  The  unconditional  moan 
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squared  error  of  prediction  when  predictors  are  random  will  be 
denoted  MSEP^. 

In  the  present  context,  the  subset  of  the  variables  chosen  is 
the  subset  that  minimizes  CVjj  of  (12).  Having  chosen  a  subset, 
the  full  training  set  on  all  n  observations  is  then  used  to  estimate 
the  parameters  to  be  used  for  prediction.  Both  Dawid  and 
Efron,  in  the  papers  just  referred  to,  point  out  that  CV  is  based 
on  training  set  of  size  n  — 1.  Consequently,  CV  underestimates, 
or  is  biased  downward,  for  the  unconditional  MSEP4.  for  a  given 
subset  because  the  Final  parameter  estimates  will  be  based  on  the 
entire  training  set,  which  has  size  n.  Dawid’s  exact  expression  in 
the  multivariate  normal  case,  and  Efron’s  (1986)  simulations  for 
a  linear  fit  to  a  quadratic  (true  but  unknown)  regression  function 
indicate  that  the  bias  is  small,  as  might  be  expected. 

The  same  question  as  in  section  3  arises  here  concerning 
conditional  mean  squared  error  of  prediction  (CMSEP-j.).  The 
prediction  equations  use  estimated  parameters.  Which 
conditional  prediction  equation,  conditional  on  the  current 
training  set,  yields  the  smallest  CMSEP_f.?  Thompson  (1978) 
motivates  Sp  and  PRESS  (same  as  CV)  as  estimates  of 
CMSEP+.  Picard  and  Cook  (1984),  Efron  (1983,  1986)  and  Rao 
(1987)  also  view  CV  as  an  estimate  of  CMSEP-|..  In  what 
follows  evidence  is  given  to  suggest  that  a  CV  assessment 
obtained  from  a  given  data  set  for  a  subset  of  variables  is 
uncorrelated  with  the  CMSEP^  for  that  data  set.  This  runs 
counter  to  intuition  since  CV  actually  simulates  the  process  of 
prediction  by  withholding  independent  observations  to  be  used 
for  prediction.  As  in  section  3,  the  results  that  follow  are 
restricted  to  multivariate  normal  predictors  and  response.  The 
behavior  of  CV  in  the  general  case  is  ultimately  of  interest 
because  that  is  the  more  appropriate  situation  to  use  cross- 
validation.  The  multivariate  normal  case  is  a  first  step. 

CVjj  of  (11)  expands  as  (eg.  p  430,  Montgomery  and  Peck, 
1982) 

cv^=-iE(yi-yi(./ 

=  i(X/J+f)'Qk(X/3  +  <) 

=  +  2/j'XQ^f  +  (ttJ) 

where 

=  diag(I-Pj^) 

'’k  = 

Qk  =  (•-Pk)('-'’k)''(‘-‘'k> 

Like  C)(,  CV|^  is  also  a  sum  of  a  quadratic  and  linear  form  in 
c.  The  difference  is  that  X  is  random,  so  that  the  coefficient 
matrices  of  the  quadratic  form  and  the  linear  form  are  random. 
They  are  fixed  exactly  like  Cj^,  conditional  on  X. 

Let  training  set  of  size  n  be  given  by 

y  =  X  0  (  (14) 

nxl  nxppxl  nxl 


Given  a  new  observation 

yo  =  ^'al^  +  f  (IS) 

the  predicted  value  using  the  k  subset  is 

('6) 

Averaged  over  new  observations,  the  CMSEP-|.  for  the  k 
subset  based  on  the  training  data  (14)  is 

CMSEP^^  =  E,„,,„[(y„-y„^)^|X,.] 


—  C  +  M)  f  +  (10 

where  Mj  and  M2  depend  only  on  X. 

Conditional  on  training  set  X,  it  can  be  shown  that: 

Q^M,  =  0 

=  0 

Thus  CV^  (13)  and  CMSEP^^  (17)  are  conditionally  (locally) 
uncorrelaled  for  every  set  of  predictors  (14).  In  general, 
conditionally  uncorrelated  does  not  imply  unconditionally 
uncorrelated.  Based  on  simulations,  however,  there  is  strong 
evidence  that  CV^  and  CMSEP^^.  are  unconditionally 
uncorrelaled. 

The  following  are  the  results  of  two  of  many  simulation 
results  the  author  has  run.  The  interpretation  is  the  same  in 
every  simulation  that  has  been  tried.  The  simulations  were 
performed  in  SAS  PROC  MATRIX,  with  1000  Monte  Carlo 
iterations  per  experiment.  In  the  first  two  experiments  below, 
training  sets  of  sample  size  n  =  30  with  5  multivariate  normal 
predictors  with  means  equal  to  0  (wiog),  and  a  dependent 
response  were  generated.  The  last  two  variables  are  superfluous 
by  design. 

Experiment  #1 


yj  =  .1  +  X,  +  .SXj  +  .25X3  +  OX,  +  0X5  +  (|  (i  =  1,..  ,30) 

where 


Cov(X . X5)  = 


I 

.2 

.2 

0 

0 


.2 

1 

.2 

0 

0 


•) 


.2 

1 

0 

0 


0 

0 

0 

1 

0 


0 

0 

0 

0 

1 


where  X  is  multivariate  normal  N(/i,S) 

<  is  a  vector  of  iid  N(0,(T^)  independent  of  X 

Fit  a  linear  regression  to  a  subset  of  k  variables  as  in  (2), 
yielding  coefficients  0.  (size  kxl).  Form  pxl  vector  by 
putting  elements  of  0y  in  the  appropriate  positions 
corresponding  to  the  k  variables  in  the  subset,  then  selling  the 
remaining  values  to  0. 


and  N(0,  0.25). 

There  are  2®  — 1  possible  subsets  of  (lie  variables  (constant 
included).  Presented  here  are  the  results  for  6  of  these  as  typical 
cases. 

In  a  hierarchical  fashion; 

Fitting  a  constant  yields  CV„ 

Fitting  a  constant  plus  X|  yields  CV, 
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Pitting  a  constant  plus  Xj,  Xj  yields  CV2 


Fitting  a  constant  plus  Xj,  X^...-.  X5  yields  CV5. 


Each  of  the  6  fitted  models  has  a  corres|x)nding  C’NfSEPn^. 
After  the  1000  Iterations,  the  correlations  l>ctween  the  CV's  and 
CMSEP_j.’s  are  given  below. 


PE*»SON  COBBCLATION  COEffrClENTS  /  PRO0  >  IRI  UNOCB  H0  RHO-0  /  N  -  !»«« 

eve  CV1  CV2  cvi  Cv4  Cv5 


CM<;epe 

<“MS€P’ 

CMSEP2 

eustP3 

CUSEP4 

CMSCP5 


-«  M7e3 
«  8025 


•  eeere 
8  8888 


8  88897 

8  9756 


■8  88632 

8  8417 


-8  83t*7 
8  3264 


-8  86328 
8  9194 


8  18518 
8  8881 


The  striking  feature  is  that  the  estimated  correlations  l>etween 
and  CMSEP|j_|.  are  all  very  small,  with  p-values  all  larger 
than  O.i.  The  same  is  true  for  the  remaining  57  possible  subsets 
(i.e.  none  having  correlations  significantly  different  from  0).  I'he 
null  hyix)thesis  that  CV  and  CMSEP.^.  are  uncorrelated  is 
accepted  (statistically  speaking)  . 


Experiment  #  2 


y|  =  .25  +  .5X1  +  .1X2  +  X3  +  0X4  +  0X5  +  <.  (i  =  1 . :10) 

where 


Cov(X, . X5) 


.1  .1 


.0 


*> 


.5  .2 

1  .2 

.2  I 

.1  .1 


1 

I 

1 

I 

1 


and  N(0.  0.25). 

PC49SON  COBPCL*T(On  COtrPICIENTs  /  PRoe  >  1*1  UNOEB  "8  PmO-8  /  N  - 
CV8  Cvi  CV2  Cv3  Cv4  Cv5 


CWSEP8 

CU$CPi 

CUSCP2 

CMSCP3 

CUSEP4 

CM5EP5 


8  83588 

8  2677 

8  86244 
8  8484 

8  83528 
8  2649 

-8  82644 

0  4838 

'8  81857 
8  7385 

8  88407 

8  6779 


8  83835 
8  2256 

8  01169 
8  7128 

8  88688 
8  8281 

•8  82915 
8  3571 

0  88882 
8  7055 

8  81442 
8  6408 


8  #4718 

8  1  366 

8  01519 

8  8315 
■t  83175 

8  3159 

8  83424 
8  2794 

8  82354 

8  4571 

8  84258 

0  1  785 


8  84588 

8  1543 

-8  88388 

8  9225 

-8  01825 
8  5644 

-8  01284 

8  6858 

8  11436 

8  8083 

8  16546 
8  8881 


8  84302 
8  1  746 

'6  68896 

8  7772 

-8  82749 
8  3852 

-8  8<969 
8  5299 

8  88616 

8  7967 

8  87112 
8  8245 


8  82485 
8  4325 

-8  8»265 

8  6894 

-8  82486 
8  4472 

-8  82385 
8  45«3 

-8  88328 
8  9*94 

-8  81126 
6  721  7 


As  in  experiment  #1,  and  in  every  case  tried,  the  correlatioii.s 
between  CV  and  CMSEF*^  are  not  significantly  different  from  0. 

Experiment  #  3  (Efron,  1986) 

y|  =  X.  +  0.01  X?  +  f;  (i  =  1 . 20) 

X.  ~  N(0,  10’)  ,  fj  ~  N(0.  I) 


Filling  a  sinipin  linear  regression  to  thi.s  qnarlratic  (non-linear) 
data 


Vj  =  +  .^.X; 

has  a  corresponding  cross-validatory  a.s.se.ssnient  CV.  1  he 
following,  reprorlnced  from  Kfron  (1086),  gives  the  first  10  of  20 
Monte  Carlo  trials,  plus  a  summary  of  all  20  trials.  f-rr_|_  means 
the  same  as  CMSF-Pq.  in  this  paper.  The  fundamental  difference 
in  ex|)eriment  #3  is  that  Efron's  true  simulated  regression  is 
quadratic  while  the  model  fit  is  a  linear  one.  He  also  compares 
the  l)ootstrap  estimate  of  CMSEPq-. 


CV 

ffio) 

(B  -  400) 

Bfr, 

(7.U) 

3.36 

318 

3.42 

3.S4 

i87 

321 

4.73 

4.48 

2.79 

4.08 

3.66 

3.42 

2.84 

267 

3.11 

4.80 

422 

2.86 

2.3 1 

2-17 

3.46 

3.0t 

2.54 

2.72 

3.67 

3.16 

4.81 

4  13 

3.64 

372 

/AVE  326 

2.97 

328 

\(SO):  (1.15) 

(.88) 

(.54) 

Notice  that  the  CV  is  nearly  iinhiased  for  average  C.MSEPq., 
while  the  bootstrap  is  biased  downward  considerably  more.  On 
the  other  hand,  the  liootstrap  is  clo.scr  to  (.’MSFU'q.  in  a  mean 
squared  error  sense  than  is  CV  (see  Efron  1986  for  details). 
Efron  gives  evidence  that  the  liootstrap  is  a  “somewhat”  better 
estimate  of  CMSEP+  than  is  CV.  although  not  strongly  so 
judging  by  his  simulation. 

The  point  made  in  this  paper  is  that  computing  the 
correlation  between  CV  and  CMSEP+  in  Efron's  simulation 
yields  r  =  -.11  and  p-value  =  .72.  There  is  again  evidence  that 
CV  and  CMSEP+  are  uncorrelated,  this  time  when  the  true 
regression  is  non-linear.  This  author  maintains  that  it  is 
awkward  to  interpret  one  random  variable  (CV)  as  an  estitnate 
of  another  random  variable  (CMSEPq.)  when  the  two  arc 
uncorrelated.  The  mean  squared  error  of  their  difference  comes 
.solely  from  their  respective  variances  and  the  squared  difference 
in  their  expected  values.  One  doesn't  track  the  other  in  any 
discernible  way  if  they're  uncorrelated.  Cp  and  CV  must  be  seen 
as  estimates  of  the  unconditional  mean  squared  error  of 
prediction  of  a  subset  of  variables.  . 
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IT’S  TIMF  TO  STOP 

by 

Hubert  Lilliefors 
'Ihe  George  Washington  Lniversity 


ABSIRACl:  Simulations  are  frequently  used  to 
estimate  certain  characteristics  of  a  itistribution.  A 
question  that  arises  is  how  large  a  sample  should 
we  use?  We  consider  specifically  the  estimation  of 
population  quantiles.  The  procedure  pi'esented  here 
relies  on  the  large  samjile  normality  of  sample 
quantiles.  I'his  requires  an  estimate  of  the  d'^nsity 
function  evaluated  at  the  quantile.  An  apparently 
new  estimator  is  used  and  is  compared  to  the 
Siddiqui  I'stimator.  Simulation  results  are  used  t.a 
I'ompare  the  estimators  and  also  to  compare  several 
slopping  procedures. 

I.  INTRODUCTION:  Simulations  are  frequently 

used  to  estimate  certain  char.arterist  ics  of  a 
distribution  such  as  the  mean  or  the  median.  In 
the  case  explicitly  considered  lu  this  paper  the 
estimated  characteristics  are  the  90"',  H.')'"  and 

yH'"  quantiles  which  are  apprc  iriate  when 
generating  irilical  values  for  some  test  statistic. 
The  lest  statistic  is  generated  independently  a 
large  number  of  times  and  then  the  sample  (piantile 
IS  used  as  an  estimate  of  the  population  quantile. 

The  question  is  :  how  large  is  large?  When  can 
the  simulation  he  stopped”  Should  there  lie  .itlO 
repetitions?  or  .aUOO  rejiet  itiftos”  or 
repetitions”  line  approach  is  to  generale 

confidence  inter',  ils  sequentially  until  a  prescribed 
fixed  width  is  obtained.  This  is  discussed  In  l.aw 
and  Keltoii  ir.iKi.')  for  estimation  of  the  mean  of  a 
distriliution  (additional  references  are  also  given). 
Recently  Dallal  and  Wilkinson  (19Kt;)  ii.sed  this  t\pe 
of  [.roceilure  f,,r  ciuantile  estimation.  They  started 
vllh  a  saipfile  si.?e  of  5u(l(i()  and  computed  a  9f)% 
confidence  interval  for  the  99th  cpiaiitile.  If  the 
uidfh  of  the  Iiiterx'al  was  less  than  some 
prescrit.ed  width  (lhe\  used  .0011  they  stopped, 
iilherwise  the\  added  anoltier  fiOOOU  to  the  sample 
•ail. I  trie  I  again,  fiu.  continued  until  either  'heir 
c.  iidition  was  satisfied  er  they  re.ached  an  iii.per 
liind  on  the  samjile  sire. 

lu  this  [..qier  an  .alteriiat  ive  ll.irg.’  sampli'l 
piocedui'e  is  presented  t'or  deterniiuliig  when  to 
stop  a  SI  mu  l.'-it  ion  u  hen  ttie  (lurpose  of  the 
simul.alion  is  to  estimate  a  [>opu  lal  i.  a  i  quaiitih-. 
'll, IS  pro.  .dure  uses  an  (app.ireiil  ly  new)  estimator 
for  the  valiif*  of  a  densit;  t'unclioii  evalu.ated  at  i 
[lart  icular  quantile.  Ihe  method  compares  fa\r>rall,' 
to  Ihe  Dallal  X  'aTlhinsi,,i  procedure  in  a  raltier 
limit.’d  simulation.  Ihe  same  m>-th>,l  niighi  t.e  used 
uheri  eslimitiug  other  c  haracl  er  ist  us  .if  a 
d  1st  r  I  bu  I  ion. 

J.  THE  NEW  PROCEDURE:  The  alternative  [irocednre 
makes  use  of  the  sell  k  nos  ,i  asy  m  [ibit  i-  (normal) 
dislribulion  ^if  samph  rpiantiles  I  see  for  exampl. 
liavi.l  il'.tfO)).  If  se  re.piire  a  9.i%  [irobahilit  >  'h.at 
the  sample  .juautile  is  wittun  a  distanee  l(  .  >f  t  he 
o.iiuilalion  .quanllle,  then  the  sample  size  requir.it 
IS  gi\en  h\  : 

i?.l )  \:(.(  1-pi  l.:)i;/lli»ff 

u  hert  IS  the  1?^'  p.  .fill  iat  II  Ml  .luaiitile,  an.d  an 

estimate  IS  ilfieded  ti.r  the  .ieiisiti  tllii.  tl.iii 


evaluated  at  the  population  quanllle. 

two  density  estimators  were  used,  the  Siddiqui 
(1900)  estimator  and  a  new  least  squares  estimator, 
'these  are  discussed  in  the  next  section. 

The  procedure  has  two  stages.  In  the  tirst  stage 
a  preliminary  sample  is  drawn  to  jirox  ide  an 
estimate  of  the  density  function  ex’aluated  at  the 
.pi.antile  of  interest.  Using  this  estimate,  and  (2.11 
an  estimate  is  obtained  of  the  required  total 
sample  size,  and  hence  the  size  of  the  additional 
sample  needed.  The  second  sample  is  drawn  and 
the  sample  .quantile  is  determined  from  the 
. Diiibined  samples  to  provide  an  estimate  v>f  the 
population  quantile. 

-\  (thi-ee  stage)  variation  on  this  procedure  was 
also  tried  in  which,  after  the  second  sample  is 
urawii,  we  again  estimate  the  density  functirn  and 
it  a  larger  sample  is  determined  to  be  necessari 
we  draw  aiiolhrr  sample. 

li.  DENSITY  ESTIMATORS:  In  order  to  use  (2.1)  to 
determine  when  to  stop  the  simulation,  an  estimate 
.s  reipiired  for  the  reiiproral  of  the  density 
function  evaluated  at  the  quantile  of  interest, 
■ftiere  has  tieen  a  great  deal  of  work  on  estimating 
density  functions.  Silverman  (198b)  pi'ovides  a 
good  general  de.srripl  ion  of  the  basic  techniques. 
An+  of  It.i'Se  t.ichuiques  might  be  used  to  obtain 
an  es'imate  .  .f  Ihe  densil.v  functin.i  and  then 
evaluate  tins  estimated  density  functuin  at  the 
estimate  of  the  quantile. 

A  rather  clever  proce.dure  uhich  avoids  using 
two  estimates  v.as  suggested  bv  Siddi.pii  (19110), 
further  developed  by  Bloch  and  (iastwirth  (1968) 
;iud  by  Bofiuger  (1975).  \  second  estimator,  whi.  h 

uses  .a  least  squares  calculation  (see  also 
1- Idessouk  •  11985)1  was  suggested  by  the  form  if 

ttie  Siddiqui  estimator.  These  are  desi  I'lbed  bek.w. 

al  SIDDIQUI  ESTIMATOR:  We  foll.iw  the 
developmf  iit  lu  Bofiuger  (19751  with  slight  rhaugi  s 
lu  III  tat i.'.u. 

The  Siddi.pii  .'Stimator  Tor  1/1(\|,)  is 

t.'..!)  i(v)  -  l\im)~'Vtn))/l'tuiu~)uul 

where  \  IS  the  s.ami  li-  si.’c 

m:l  M  p  +  iim)  1+  I 
ncl  \ (  p  +  d„  )  1+  1 

lor  the  cas,.  d„  =  dn  =  d,  t  h.'  equalmu  f.ir  T.m 

om*  s! 

(.t.'J-  !(\)  -  {  \  (  m  I”  \  ( II 1  )  /  t-'^ 

anti  I'.f’inxX'M'  sfiitws  that  as\ in  pt  (  t  ica  1 1  >  tn*- 
(in.s.f.  'a'*i\si')  v  alu«'  fi-f  >1  is 

(.^.U  .1  r  r/\-* 

y- 

t::.n  .-=1 

-{ !  'Kpi ;  M/!'!-.' 
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Thp  table  below  shows  the  values  for  <' 
calculated  from  (:t.4l.  The  values  of  these  (oplimall 
values  for  C  do  not  change  a  great  deal  as  we  go 
from  the  heavy  tailed  esponentiai  distribution  to 
the  light  tailed  U'eibull  (with  shape  parameter 
-•})  distribution. 


TABl.K  1  OPTIMAL  VAI.I'L:  POP  C 


(4L  AN'fll.l-. 


DISTRIBUTION 

.50 

.90 

.95 

.99 

Exponential 

.5«80 

.162;; 

.09:12 

.02:77 

Normal 

.6447 

.1876 

.104:1 

1  .0277 

WeibulK  -41 

.2286 

.1:149 

.04  17 

Intuitively  it 

is  fairly 

clear 

w  h  y 

the  Siddi<]U 

estimator  works.  If  for  x  in  some  region 
xitii  <  \  t  N(i.)  (where  X(ni  and  \(«)  are  the 
order  st.atistics  defined  above)  tb-;  cumulatisv 
distribution  function,  l•'(\),  is  approximately  lin.'ar, 
ttieji  the  slope  of  (hat  line  is  the  density  function, 
f(x),  anti  if  Xp  (  the  p'*'  quantile)  lies  within  that 
interval  f(Xp)  will  equal  that  slope.  Thus  we  want 
to  fake  an  inter\'a)  narrow  enough  so  that  the 
.•iPC'roximate  linearity  holds. 

b)  LEAST  SQUARES  ESTIMATORS  We  note  that 
there  will  be  many  data  points  between  Xmi  and 
ximii  but  that  for  the  Siddiqui  estimator  we  simply 
connect  the  two  extreme  points  to  get  an 
approximation  to  p(xl  in  that  interval  (and  then 
use  the  sk.pe  of  that  lin*'  as  the  estimate  for  the 
density  function  f(x)  ). 

Two  possibilities  occur  immediately; 

(i)  The  straight  line  approximation  is  good,  but 
why  not  use  all  the  data'.’  Using  the  same  notation 
as  in  (il.l)  abo\e,  we  use  the  end  points  and  Ik 
(evenly  spaced)  additional  poinl.s  betwepii  Xin)  and 
\(m)  for  a  total  of  20  points  and  then  use  the 
ordinary  least  squares  procedure  to  fit  a  straight 
line  to  the  data.  This  seemed  to  work  abc.ul  as 
well  as  the  Siddiqui  procedure,  but  was  no 
improvement.  I'sing  all  the  points  between  X(ni  and 
<(■)  gave  almost  exactly  the  same  |•eslllls  but 
required  considerably  more  time  for  tb*' 
comp.utations.  (A  weighted  least  squares  might  have 
given  an  improvement  but  was  not  tried.) 

(li)  If  a  straight  line  approximation  to  L'(x)  works 
well,  then  why  not  try  a  quadratic  approximation 
using  the  ordinary  least  squares  fit  to  the  20 
points  as  described  above.  This  will  give 

F{x)  r  bo  +  bix-tbjx^ 
and  from  this  f(x)  :  bi  -t  2b2X 

and  to  estimate  f(xp),  the  density  function 
evaluated  at  the  p'"  quantile,  the  p'"  sample 
quantile  as  an  estimate  of  the  pfi  population 
quantile. 

This  seems  to  have  two  advantages  over  the 
Siddiqui  estimator: 


(a)  It  is  more  accurate  biised  on  a  rather  liiiiitr-.l 

simulation. 

(b)  It  is  less  seiisitixe  to  the  choice  of  d  (  or 

equivalenfly  to  the  choice  of  the  ejidpoinls  of  the 
interval  -  but  see  the  discussion  below  of  the 

simul.ation  results  with  the  expoiiential 
distribul  ion.) 

For  the  interval  of  x  values,  we  use  an  inlerx'al 
of  the  same  form  as  that  for  the  Siddiqui 

•  •stimatoi'  (see  Id. 21  and  (.d.d))  and  tr>'  difti-rent 

values  for  ('  (see  (d.d)). 

1.  SIMULATION  TO  DETERMINE  THE  VALUE  FOR 
C;  A  simulation  was  used  to  determine  the 
sensitivity  of  the  Siddiqui  estimator  to  the  choice 
of  C  and  to  determine  which  value  of  i'  to  use 
with  the  least  squares  estimator.  The  simu'ation 
was  performed  for  eae'h  of  three  d ist  ri m it ioi ,s 

ranging  fre>m  light  tail  (tXeibull)  to  heavy  tail 
(Fxponential).  In  each  case  the  sample  si/.-  used  in 
making  the  estim.ate  was  lUUU  an.d  there  icen'e  ;iUU0 
repelit  ions. 

fABI.l-  2a  Density  Function  I'lstimale  at  .'iU 
Quantile  for  Vveibull  Inslributir.n 
witti  shai'ie  parameter  4.  The  actual 
value  is  .7  Ik. 

SlDDlQt  1  LEtST  SC^UARFS 

F  S  T I M  A  TO  I?  »  F  S  T 1 M  A ')'  0  R 


(' 

AVFRAt 

IE  MSE 

AVEHAl'.F 

MSE 

,1)7 

.774 

.0217 

.812 

.0371 

,10 

.760 

.0111 

.784 

.0162 

.17 

.7;ii 

.0067 

.7,2 

.0098 

.20 

.718 

.00,5  1 

,76  7 

.ou7;i 

.681 

.0076 

.761 

.0077 

*:u) 

.(,1.5 

.0129 

.762 

.004  7 

.:i5 

.572 

.760 

.uo:iy 

,10 

.989 

.1308 

.I'M 

.0026 

t 

Optimal  (■ 

from  (3.;i4) 

IB  .229 

LABI.K  2h  Density  Function  Estimate  at  .90 
iiuantile  for  Normal  Distribution. 
Actual  value  is  .176. 


SIDDlOl'l  L.FAST  SOI  AB'FS 

I'.S  IT'IATOB'*  FS  1  IMATOR 


(-  \tik..\(iE 

.■'ISE  Atl'.HAGK 

MSF 

.07 

.176 

.00129 

.191 

,00209 

.10 

.178 

.00060 

.181 

.00094 

.1.5 

.17) 

.00038 

.181 

.000.55 

.20 

.  1 68 

.00032 

.180 

.00042 

.2.5 

.178 

.00048 

.180 

.(;003  1 

..10 

.149 

.00086 

.180 

.00029 

.35 

.131 

.00213 

.179 

.00024 

.40 

.087 

.00829 

.171 

.00017 

»  Optimal 

(■  from 

(3.:!4l  is 

188 

1  ABLE  2c 

Density 

1-  unction 

Estimate 

at  .90 

Ouanlile  for  l-.xponent ial  Inst ribution 
Actual  value  is  .10. 

SIDDIQUI  LEAST  SQUARES 

ESTIMATOR*  ESTIMATOR 


f' 

A\  ERAiiE 

;  MSE 

average 

MSE 

.05 

.100 

.0004:1 

.109 

.00073 

.10 

.101 

.00021 

.106 

.00032 

,15 

.096 

.00014 

.104 

.00022 

,20 

.093 

.00015 

.104 

.00017 

.25 

.086 

.0002  7 

.104 

.00015 

.079 

.00051 

.105 

.00013 

,35 

.065 

,00125 

.104 

.00010 

.40 

.036 

.00421 

.089 

.00018 

«  Optimal  V  from  (d.dl)  is  .Ibd 


5.  SIMULATION  FOR  STOPPING  TIMES:  Four 
procedures  for  determining  when  to  stop  a 
simulation  are  compared  using  (of  all  things!)  a 
simulation.  Each  procedure  was  repeated  rlOOO 
times. 

The  procedures  were: 

(a)  An  initial  sample  of  size  1000  was  selected.  Vxe 
used  the  Siddiqui  density  estimator  with  C=.i!  and 
then  using  equation  (2.1)  obtained  an  estimate  of 
the  sample  size  necessary  for  a  9S%  probability  of 
being  within  a  pi'escribed  distance  of  the  .90 
quantile.  The  required  additional  sample  was  then 
drawn  and  using  the  combined  sample  the  quantil.? 
was  estimated.  For  each  repetition  we  record  the 
sample  size  used  and  whether  the  estimate  is 
within  the  prescribed  distance  of  the  actual  .90 
quantile. 

(b)  Same  as  (a)  except  that  we  used  the  Least 
Squares  density  estimator  with  C=.:i5. 

(c)  Same  as  (b)  except  that  after  the  second 
sample  is  drawn  we  again  calculate  a  (least 
squares)  estimate  of  the  density  functioii  and 
using  equation  (2.1)  again,  if  we  need  a  larger 
sample  than  has  already  been  drawn  we  draw  the 
required  additional  observations.  This  is  <-alled  the 
■i  stage  least  squares. 

id)  Starting  with  an  initial  sample  size  of  1000  we 
calculated  a  9:1%  confidence  interval  for  the  .90 
quaiitile.  If  the  half  width  of  the  interval  was  less 
than  the  prescribed  distance  used  with  the  other 
procedures  we  stopped.  If  not  we  draw  an 
additional  200  observat io.is  and  using  the  combined 
sample  determine  again  the  confidence  interval  and 
again  r'omp.ared  the  half  width  to  the  previously 

prescribed  distance.  1  his  is  repeated  until  the  half 
width  of  the  interval  is  less  than  the  prescribed 
distance.  This  folkiws  the  Dallal  and  hiDtinson 
(19Hh)  procedure. 


ruM.i-  :ia 
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TAB1J-:  :ic  h’esults  of  Stopping  Time  Simulations 
for  Exponential  Distribution 


Proportion 
Within  .120 

Pr.aoedure  of. 90  Quantile 

Average 

Sampfe 

Size 

Standard 

De\  iatiori  of 
Sample  Size 

.S 1  dd  i  qu  i  ( a ) 

.9B1 

28hh 

586 

1  .east  Sq  1  b ) 

.939 

2280 

386 

Least  Sq  |c) 

.94.4 

2370 

352 

( 3  Stages ) 
Conf  Int  (d) 

.943 

2552 

632 

l,<?ast  Sq  I  e  1 » 

.970 

3073 

509 

t  To  indicate 

how  sensitive 

the  results  can  be  to 

the  choice  of 

C,  we  also  ran 

this  with 

C=.4  with 

considerably  different  results  than  under  (b)  with 
C=.35. 

In  order  to  show  what  hapjrens  when  the 
procedure  causes  the  simulation  to  stop  with  a 
smaller  sample  size  we  also  include  fable  4.  A 
similar  table  is  given  in  Lilliefors  (1987). 

1  his  table  gixes  the  breakdown  of  the  proportion 
of  the  quantih  estimates  that  are  within  the 
liresi  ribed  .Olio  of  the  .9(1  quantile  for  the  Normal 
distribution  according  to  sample  size  intervals  (eg 
<!2.')(),  between  12.50  and  1500,  etc) 

I  able  4  Breakdown  by  sample  size  for  Confidence 
Interval  I'rocedui-e  for  normal  distribution 


Sample  Size 

number 

Number  Within 

Proportion 

of 

.065  of  .90 

within  .06,5 

Samples 

Quantile 

of  Quantile 

12.50 

.’ll) 

23 

.77 

1251-1500 

:!8 

31 

.82 

1501-1750 

124 

107 

.86 

1751-2000 

516 

468 

.91 

2001-22.50 

184 

446 

.92 

2251-2:500 

620 

585 

.94 

2r.0 1-3000 

1546 

1484 

.96 

:t00 1-3500 

853 

833 

.98 

3501-4000 

628 

614 

.98 

>4000 

Ifil 

160 

.99 

.5.  DISCUSSION  OF  RESULTS: 

la)  first  .nf  all  it  shruiri  be  noted  that  this  has 
been  a  r.athei'  limited  comp.arisoii.  1  dr.  have  some 

ait.iitional  results  for  thi‘  .95  .quantile  wliii  ti  are 

pi  eity  much  in  a.'i  ord  with  these. 

It))  line  .'01111019 1 1 1  g  result  is  thal  all  the 

|.io<edur.-s  Bt'eiii  In  work  reasonabi'.  well. 

le)  The  only  real  ,1  if  fi‘ renre  between  the 

eoiifiden.'f.  inler'.til  [irocecinre  and  the  n.'.w 

prfi.'.'.iures  thal  are  consutere.t  is  in  ti  rms  of  the 
stall  iai'd  (te\'\atioii  of  the  sample  size.  1  he 
.'onfideiice  interv.al  [iro.  .'dure  s.'eins  t..  Iia\>  a 
('.insistent  l\  lai’.g.'r  standar.t  .tex  iati.’ii  Itiaii  ttie 
pr.iC'din.'S  using  lh.‘  densit\  estimators.  \s  n-.t.*.! 
In  i  illiefors  (1'.).87)  t  lie  proportion  of  iiiter\als  that 
....er  tlie  tin.,  .'alue  nf  the  tjii.antile  r  i  .1  ui  1 1 1.'  la  1  .Ml 
tt:-'  sample  st/f.  le  mg  small  ma\*  hi'  much  l.ss  than 
tin*  iioii.'n.'il  95%.  Si'i-  also  laid.'  1  f.ir  aii..t'i(‘r  lo.dt 
at  this  pr.'l'lem. 
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(d)  As  noted  previously,  it  appears  that  the  least 
squares  estimator  is  better  than  the  Siddiqui 
estimator.  It  is  generally  less  sensitive  to  the  value 
of  C  (which  determines  the  interval  width)  and 
provides  an  estimate  with  a  smaller  standard 

deviation. 
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SIMULATING  STATIONARY  GAUSSIAN  ARMA  TIME  SERIES 

Terry  J.  Woodfield,  SAS  Institute  Inc. 

Box  8000,  SAS  Circle,  Cary,  NC  27512-8000 


1.  INTRODUCTION 

Many  instructors  and  researchers  often  find  that  the  simula¬ 
tion  of  time  scries  data  is  a  necessary  part  of  their  work.  The  pro¬ 
liferation  of  textbooks  that  describe  the  Box-Jenkins  strategy  for 
modeling  time  series  has  popularized  the  use  of  time  series  models 
having  a  stationary  autoregressive  moving  average  (ARMA)  struc¬ 
ture.  Accurate  and  efficient  simulation  algorithms  for  Gaussian 
ARMA  processes  are  required  for  many  applications. 

The  literature  on  simulation  of  stochastic  data  is  vast,  but  spe¬ 
cific  articles  focusing  on  time  series  simulation  are  rare.  The  algo¬ 
rithms  discussed  in  this  paper  have  been  extracted  from  a  variety 
of  secondary  sources.  Primary  sources  arc  scarce,  perhaps  because 
the  algorithms  are  straightforward  and  easily  derived  and  hence 
not  suitable  for  publication  in  scholarly  journals. 

There  are  three  components  of  a  time  series  generator. 

1.  Algorithm  to  generate  pseudo  random  numbers. 

2.  Algorithm  to  convert  pseudo  random  numbers  to  pseudo 
random  normal  deviates. 

3.  Algorithm  to  convert  pseudo  random  normal  deviates  to  a 
time  series. 

Erncient  and  pracljcal  solutions  to  components  2  and  3  have 
existed  for  some  lime.  Component  1  has  been  investigated  exten¬ 
sively,  but  choice  of  an  optimal  pseudo  random  number  generator 
is  still  an  open  question  possibly  having  no  unique  solution.  The 
current  practice  seems  to  be  to  declare  a  random  number  genera¬ 
tor  to  be  adequate  unless  it  can  be  shown  to  have  poor  properties. 
Thus,  our  research  is  motivated  by  the  concern  that  a  pseudo  ran¬ 
dom  number  generator  that  has  passed  e.xisting  tests  may  fail  to 
prodtice  reasonable  time  series  data.  Intuitively,  pseudo  random 
number  generators  that  have  good  n-space  uniformity  should  pro¬ 
duce  white  noi.se  sequences  that  are  adequate  for  generating  ARMA 
time  series,  This  paper  provides  some  preliminary  results  that  sup¬ 
port  this  conjecture. 

2.  THE  MODEL 

The  univariate  ARMA  model  for  a  stationary  time  series  is 

=  (1) 

where 

1 .  D  is  the  backshift  operator  defined  by  BYi  =  _  j . 

2.  B)  -  ^  0\  B  4)2  B^  -t  <ppB^ ,&{  8)  =  00+^1^  + 

OzB'  -+  ...  +  9^8*’,  where  4>o  ~  “  1- 

3.  {f,}  is  a  while  noise  scries,  i.e.,  independently  and  nor¬ 
mally  tlibtribuled  with  mean  zero  and  variance  <7'  >  0. 

■1  The  zeroes  of  and  6{B)  lie  outside  the  unit  rirch*. 

an<l  4){B)  and  9(  B)  have  no  common  zeroes. 

Tlie  {f  f  I  series  i>  refer  red  to  as  the  err«»r  sequence  or  the  iniima- 
tion  sequence.  Tlii-  nolation  an<l  lerininolcjey  employe«i  is  primar 
ily  that  of  Box  and  .len kins  (1076)  However,  note  that  l he  signs  of 
the  mode!  coeMictent-S  are  ripposile  those  given  by  Box  and  Jenk- 
I  ns 

J  he  niei!uMb  desrribefl  will  gem-rate  a  senes  willi  KiVf!  - 

!)  and  uwiovathJii  vanan«e  -  1,  Ib  olilaiii  a  tune  senes  IV, 

with  .spefi.^’efl  mean  fj  and  error  siamiard  d<'vialion  (t  ■  0.  use  the 
transformation  It’,  ^  <7V',  *  ft.  The  variance  of  U',  will  lie /t"  limes 
the  variance  r^f  .  For  specified  varianc**  r-. .  use  transformation 
\Vf  .  t  p,  where  k  -  \/cr„./(7^ 

All  melho«is  may  use  the  following  recursion  to  generate 
1- 


V,  =  -  ^  0,yi-i  4  fi  +  (  =  p,p  +  1, . . .  ,n  -  1.  (2) 

*■  =  1  t  =  I 

We  may  treat  the  methods  as  differing  only  in  how  they  produce 
starting  values  Yq,  , . . . ,  Yp.  i .  Note  that  for  the  two  exact  meth¬ 
ods,  this  recursion  is  described  in  the  context  of  the  method  em¬ 
ployed  and  is  somewhat  different  than  is  given  in  eouation  (2) 
There  arc  algorithms  that  do  not  depend  on  the  ARMA  recursion 
relationship;  for  example,  an  efficient  algorithm  c-xists  that  uses 
the  Kalman  Filter.  Also,  some  applications  have  employed  the  al¬ 
gorithms  discussed  to  generate  the  entire  series  and  not  just  the 
starting  values. 

Typically,  all  algorithms  will  have  a  branch  such  that  if  p  =  0, 
then  the  simple  moving  average  recursion  is  employed  on  a  white 
noise  sequence  of  length  n  -f-  q.  The  algorithm  is 


1.  Generate  t-^+\ .  i  using  an  appropriate  ran¬ 

dom  normal  generator  such  as  the  one  proposed  by 
Marsaglia  as  described  in  Kennedy  and  Gentle  (1080). 

2.  Form 

9 

Yt  =  «  =  0, 1 . 71-1. 

k  =  0 


3.  SI.MULATION  .ALGORITHMS 

In  this  work,  four  methods  are  discussed  for  .simulating  an 
ARMA(p,<j)  process. 


3.1  -Approximate  Methods 

The  simplest  approximation  method  lisrs  starting  values  )  .p  - 
=  ...  =  )'_i  =  0  and  employ.s  the  recursions  of  equation  (2) 
to  generate  the  series  realization.  Since  the  first  few  values  in  the 
simulalctl  series  will  be  effected  by  the  null  starling  values,  the 
point  at  which  tlie  effect  of  these  values  is  minimal  will  be  used  as 
the  starting  point  for  the  series. 

The  effect  of  starling  values  Y  ^  ,  h  =  \  ,'2 .  .  .p.  can  be  mon- 

itore<l  using  the  following  algorithm.  L«‘t  >  he  the  coeffincnl 
representing  the  effect  of  V’.  ^  on  future  value  V,.  and  let  0,  ^  bo 
the  coelficieul  representing  the  clleet  of  <j  on  I,.  Tlie  relalion- 
siiip  lieiween  Y, ,  the  starting  values,  and  the  iiino\aii>»ns  is  given 
bv 

p  I 

y,  ■  ^  (3) 

fc  I  J  .  - 

The  weights  4*,  *  and  O,  ^  are  oiitained  using  the  following  recur¬ 
sions. 

^■ijk  7  Ok.  1  --  1,2.  p 
0„,-k  77  Ok,  k  =  1,2,., 

0„  ,  =  0.  i  =  0,  1,2,  .  , 

1  -  I 

'I-,  k  0,,k  -  Oit,.,  k 

« -  1 

t  \ 

(->,  k  -  k  -  V  0,0,  k  r  ,  1  . 

«  I 

P.k  0,  n,  1,2,  , 
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where  any  coefficients  with  subscripts  out  of  bounds  are  taken  to 
be  zero.  For  example,  consider  i!ie  model 

(1  -  OAB  -  =  (1  -  0 


The  ^  matrix  is  given  by 


$  = 


/0. 40000 
0.4SOOO 
0.32000 
0.28160 
\0. 21804 


0.320000  \ 
0.128000 
0.153600 
0.102100 
0.090112/ 


1.  Obtain  F  =  (Cot'(y,,  Vj ))  -  (7^;),  0  <  i,j  <  1  For 

a  method  to  compute  the  autocovariance  function  of  an 
ARMA(p.  g)  process,  see  McLeod  (1975,1977). 

2.  Form  the  Cholesky  root  of  F,  that  is,  find  H  such  that 
F  —  HH*,  where  H  is  a  lower  triangular  matrix. 

3.  Generate  ^0.  ^p-ii  using  an  appropriate  random 

normal  generator. 

4.  Form  Y  =  (Vo.  .  •  •  •  t  i'p- 1 )'  using  Y  =  He,  where  e  - 

5.  Generate  Cp-^+i,. . . ,  fn-i.  using  an  appropriate  ran¬ 
dom  normal  generator. 

6.  Generate  V'p,  Kp  +  i , . . . ,  1  using  recursion  equation  (2) 

above. 


where  rows  are  numbered  t  =  0,1,2,  3,4,  and  columns  are  num¬ 
bered  it  =  1,  2.  The  0  matrix  is  given  by 


/  -0.10000 

-0.30000 

i.oooo 

0.000 

0.00 

0.0 

0.0\ 

-0.04000 

-0.22000 

0.1000 

1.000 

o.on 

0.0 

0.0 

©  =: 

-0.04800 

-0.18400 

0.2600 

0.100 

1.00 

0.0 

0.0 

-0.03200 

-0.14400 

0.1360 

0.260 

0.10 

1.0 

0.0 

\  -0.02816 

-0.11648 

0.13T6 

0.136 

0.26 

0.1 

1.0/ 

where 

rows  arc 

numbered  t 

=  0, 

1,  2,3,4. 

and 

columns  arc 

numbered  k  = 

-2,  -1,0, 1, 

2,3,  4. 

Hence, 

the 

series 

Y 

= 

( Yo,  Yi , . . .  ,  Yn)'  may  be  formed  using 

Y  =  Oe  -i-  ^ly , 

where  c  =  (€_,, fn)')  ^tnd  y  =  (I'-i ,  Y.2)'- 

Note  that  the  last  row  of  0  converges  to  the  infinite  MA  rep¬ 
resentation  of  the  model.  Wlien  the  la.st  row  of  ^  is  negligible, 
then  one  may  assume  that  the  effect  of  the  starting  values  has  van¬ 
ished  and  that  steady  state  has  been  reached.  However,  when  start¬ 
ing  values  of  2ero  are  employed,  steady  state  is  not  reached  until 
the  last  row  of  the  0  matrix  closely  matches  the  infinite  MA  repre¬ 
sentation  of  the  model.  In  practice,  the  appro.xiniation  methods  are 
not  very  competitive  because  of  the  computational  burden  of  de¬ 
termining  in  order  to  determine  where  the  actual  series  that  is 
generated  is  to  begin. 

A  more  convenient  approach  to  that  of  computing  the  ma¬ 
trix  is  to  obtain  starting  values  using  a  truncated  infinite  MA  rep¬ 
resentation  for  the  AH,MA(p,f/)  model.  Let 
where 

■!•  ( B )  ^  1  r  P  +  n-  +  i-., IS''  i  ...  (3) 

1  =  0 

The  algorithm  may  be  implemented  as  follows. 

1.  Find  k  such  that  [V'kl  >  TOL  and  V'*  +i .  f  2.*  •  •  •V'fc  +  p+9 
are  all  less  than  TOL  in  absolute  value  for  some  specified 
tolerance  TOL. 

2.  Generate  «_*,  e_i,  co,  ‘'n-i  using  an 

appropriate  random  normal  generator. 

3.  Form  Tj  =  e,  4-  +  ^2^t-2  +  . .  -  +  fof  f  “ 

0.1,.  .,p-  1. 

4.  Generate  using  recursion  equation  (2) 

above. 

This  algorithm  for  generating  a  time  series  will  be  referred  to 
in  this  paper  as  the  Psi  Weight  Method.  Note  that  using  starling 
values  of  zero  is  inferior  to  using  starting  values  generated  by  the 
Psi  Weight  Method. 

The  Psi  Weight  Method  is  a  useful  “quick  and  dirty  ’  algorithm. 
It  may  be  programmed  quickly  in  a  matrix  language  or  a  lower  level 
computer  language.  It  requires  no  laborious  calculations  and  can 

speeded  up  using  a  fast  finite  Fourier  transform  algorithm. 

Another  approach  uses  a  linear  transformation  of  p  white  noise 
values  based  on  the  autocovariance  function  of  an  ARMA(p,<;)  pro¬ 
cess  The  method  is  implemented  as  follows. 


This  algorithm  will  be  called  the  Approximate  Autocovariance 
Metiiod  ill  this  work. 

Note  that  an  equivalent  but  computationally  more  inten¬ 
sive  method  can  be  based  on  steps  1  through  4  using  an  n 
by  n  covariance  matrix  rather  than  a  p  by  p  covariance  ma¬ 
trix.  The  method  using  the  full  n  by  n  covariance  matrix  is  ex¬ 
act.  Otherwise,  this  method  is  not  exact  because  the  error  se- 
(juence  «  =:  (so. si  1  -  •  •  . 1 )*  is  independent  of  the  innovation 
sequence  (p-,,  +  Cn-i-  The  innovations  used  to  ‘‘mii- 

pute  ip+ 1 ,  ip.,.2, ....  Yp  +  ^  do  not  take  into  account  the  covariance 
E(Yt  +  *Cf)  between  the  innovations  and  the  time  scries.  To  get  an 
exact  ARMA(p,7)  realization,  the  first  q  innovations  must  be  gen¬ 
erated  having  the  appropriate  covariance  structure  with  the  time 
series.  A  method  that  accomplishes  this  task  is  described  below. 

3.2  Exact  Methods 

An  exact  finite  realization  of  an  ARMA{p,f/)  process  can  be 
obtained  using  an  enhanced  version  of  the  linear  transformation 
algorithm  employed  above.  The  method  is  'luplemcnted  as  follows. 

1.  Obtain  Fo  =  (Coy{Y,,  >'_,))  =  (ti;).  0  £  hj  £  P  -  T 

2.  Let  m  =  max(l,7  —  p  -e  1).  an<l  obtain  psi  weights 
t^i .  ^2.  •  •  • .  V’fl-m  given  by  equation  (3)  above.  Form  the 
matrix 


j  V'o 

0 

0 

■  ^ 

V'l 

'Po 

0 

...  0  ' 

:: 

^2 

V'l 

Vo 

. .  0 

0  , 

W',-m 

r/*.  -  m  - 1 

Vg  —  m  — 

!  ■  ■  r}/ 

where  i/'o  = 

1  and  T} 

=  1  if  m 

=  TJ 

-  0  utherwise. 

p  >  q  place  j 

0  —  q  rows  of  zeroes 

;  at  the 

begin  ning 

of  4' 

that 

/  0 

0 

0 

...  OX 

0 

0 

0 

0 

...  0 

V'o 

0 

0 

...  0 

'P  - 

V’o 

0 

.  0 

V’l 

V'O 

,  .  0 

V  V'9  -  m 

il' 

.  . . 

3.  Form  the  covaiiance  matrix 


where  I,  is  the  q  by  q  identity  matrix. 

4.  Form  the  Cholesky  root  H  of  Fi . 

5.  Generate  f_,,  (p-\,  using  an  appropriate  ran¬ 

dom  normal  generator. 
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6.  Form  S  =  (5_^,  +  i , .  . . .  5^.1 )'  usiu^  S  -  He, 

e  =  {€_q,t_q  +  i,...,t:p_i,)  .  Let 

7.  Form 

9 

A't  =  +  (  =  0.  1.  .  .  .  ,  n  *  ;)  -  1. 

*  =  0 

8.  Form 

>■(  -  5t_,,  (  =  0,  1 . p  I, 

p 

y,  =  Xt-p  -  t  =  p, /)  +  1,  - . . ,  a  -  1. 

k  =  l 


Thus,  the  values  Vq.  .  -  -  ■  •  i  (hrertiy  from  the  trans¬ 
formation  using  the  Cholecky  factorization,  Vp,  Vp  +  i ,  . . ,  ^ ,  j 

come  from  tlie  recursion  relation  and  the  Cholesky  laclorization  of 
the  expanded  covariance  matrix,  and  Vp^^,  rp  +  .,^.i , . . . .  V’„  _  i  come 
from  the  recursion  relation.  In  this  [lapcr,  the  above  algorithm  is 
called  the  Exact  Autocovanance  Metlu  1.  The  Exact  Autocovan 
ance  Method  is  probably  the  most  popular  simulation  tcrliui<|ue 
u:ed  in  practice.  For  example,  Ansley  and  N'ewbold  (1080)  de- 
scril)c  use  of  the  alg<irilhm  in  an  appendix,  and  an  example  of 
the  algorithm  programmed  in  SAS/IM  I.’-''  by  David  M.  DeLong 
is  given  in  the  SAS/IML  User's  Gui<le  (1085a). 

The  last  metliod  to  be  considered  is  described  in  a  liomework  ex¬ 
ercise  in  Brockwell  and  Davis  (  1987,  page  284,  problem  8.17)  and 
is  based  on  the  Innovations  Al<iortlhiTi  used  in  finite  memory  pre¬ 
diction.  The  method  is  related  to  the  linear  prediction  approach 
suggested  by  Wilson  (1078).  The  meihod  is  implemented  as  fol¬ 
lows. 

1 .  Generate  ^Ot  -  i  'Jsing  an  appropriate  random  nor¬ 

mal  generator. 

2.  Let  m  =  inax(;),<;).  Form 

( 

IV't  =  a-i>i  -f  y'  t  _  m  --  1. 

« =  1 
<7 

-  fTftt  4-  t’  ^ 

1-- 1 

where  the  coefficients  and  <7f  are  obtained  using  tiie 
recursion 

^  1^(1-  1). 

« •  1 

^  (r{7i  4  i.i-  -  1)  -  „  ..<7,-)/<7-;. 

h  ~  0.1 _ n  -  1, 

n  -  I 

fT^"^  z-  r(n  +  1 ,  n  ^  1  )  '  %  n - • 

t=0 

witli  the  covariance  function  r(i,  j)  defined  by 

L(i,j)  I'.'. 

min(i.j)  •  T7I  ''  max(j.j)  2m, 

"  ,)<*k®k  .1.  min(i,j)  >  m, 

0  otherwise 

The  values  {7*}  user!  to  compute  T{i.j)  are  the  autoco¬ 
variances  for  the  original  AHMA(p,f/)  process,  and  is 
the  error  variance 


3.  Form  1  0.  ^  1 . ^ 1  Using 

>,  —  (jlF,,  I  ■_  in. 

p 

—  erW,  ^  ^  C'feFf-fc,  t  '■  in. 
k 1 


4.  RANDOM  NUMBER  GENERATION 

rile  algorithms  discussed  above  employ  an  independent  nor¬ 
mal  random  deviate  geiu’r.itm  d'he  algorithm  of  .Marsagiia  ;ls  de¬ 
sert  be<l  in  Keniietiy  ami  Gen  lie  ( lOsO  \  li.'is  been  mentioned  as  a  suit¬ 
able  normal  <leviate  gem  rator  Since  Marsaglia's  algorithm  is  eifi- 
cient  ami  proviiies  indep<*ndent  deviat<-s  ilial  liave  an  exact  normal 
di.stnbution,  we  will  not  consider  coinpeimg  exact  or  approximate 
methods.  Instead,  we  will  focus  on  pseudo  random  number  gen¬ 
erators  that  may  be  potential  candiilates  for  use  with  Marsaglia’s 
normal  deviate  generator. 

A  linear  congrueiitial  pseudo  random  number  generator  employs 
the  recursion 

A*n  =  mAn-i  +  c(mod  M) 

When  c  =  0.  the  generator  is  called  a  multiplicative  congruen- 
tiaj  pseudo  random  number  generator.  The  constant  m  is  called 
the  multiplier,  and  M  is  called  the  modulus.  The  choice  of  multi¬ 
plier  m  determines  the  properties  of  the  generator.  Fishman  and 
Moore  (1982.  1988)  evaluate  the  generator  for  the  most  commonly 
suggested  multipliers. 

The  Tausworthe  pseudo  random  number  generator  is  a  special 
case  of  the  Generalized  Feedback  Shift  Register  (GFSR)  algorithm 
(Lewis  and  Payne  1973).  Kennedy  and  Gentle  (1980)  provide  com¬ 
puter  code  for  a  simple  Tausworthe  generator  that  uses  the  prinii- 
Uve  polynomial  ■=  f  x“  -4  1. 

For  details  on  pseudo  random  number  generators  and  how  they 
may  be  tested,  see  Kennedy  and  Gentle  (1980)  or  Knulh  (1981). 
.N'ote  that  for  any  pseudo  random  number  generator,  a  starting 
value  A*o  is  required.  For  some  generators,  choice  of  starting  value 
has  an  enVcl  on  the  properties  of  the  pseudo  random  sequence  pro¬ 
duced.  A  number  of  theoretical  and  empirical  tests  exist  to  eval¬ 
uate  random  number  generators.  We  will  evaluate  several  gener¬ 
ators  with  respect  to  the  quality  of  ARMA  time  series  produced 
using  tlie  gi'nrralors. 

5.  EVALUATING  THE  ALGORITHMS 

For  algorithms  that  depend  on  recursion  f2),  the  sequences  pro- 
♦liiccd  will  converge  to  a  common  lime  senes  if  a  common  error  se¬ 
quence  is  used  in  the  recursion.  Using  early  values  before  the  lime 
series  has  reached  steady  slate  is  not  recommended.  Assuming 
that  any  implementation  of  an  ARMA  time  series  generator  will 
ensure  that  only  steady  slate  values  are  used,  any  other  compari¬ 
son  of  the  algorithms  should  be  baser!  only  on  numerical  properties 
relaleri  to  their  implementation  on  finite  precision  digital  comput¬ 
ers. 

The  lieliavior  of  any  given  time  series  generator  ultimately  de¬ 
pends  on  the  choice  of  uiuri>rni  raiulom  number  generator  to  be  ein- 
plovcd  Hence,  statistical  evaluation  of  lime  series  generators  will 
In*  carried  r)ul  using  a  designecj  experiment  involving  type  of  uni¬ 
form  generator  as  a  primary  factor  of  interest  Ctiven  the  e<|uiva- 
lence  uf  the  algorithms  after  steady  state  is  reached,  the  experiment 
described  in  I  he  next  section  will  employ  only  the  Exact  Auloco 
variance  Method 

To  mc.asure  the  rpiality  of  a  generated  time  series,  goo<lness  of- 
fit  crilerirm  are  developeil.  One  approach  to  measuring  goodness- 
of-fit  nmkes  ceonparisons  between  the  sample  autoc\>vanancrs  ami 
the  true  aulocovariances  I'wo  measures  of  closeness  are 
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MSE  =  -  y  (7*  -7*)-, 

in  ' 


.  m  —  1 

A  /..1D  =  i  y 

711 


where  7^  is  the  theoretical  autocovariance  function  at  lag  Ic,  7fc  is 
the  sample  autocovariance  function  at  lag  k,  and  m  is  chosen  so 
that  values  of  the  true  autocovariance  sequence  beyond  lag  m  are 
relatively  small.  Intuitively,  the  sample  autocovariances  will  con¬ 
verge  to  the  true  autocovariance  function,  so  the  measures  MSE 
and  M  AD  will  reflect  whether  the  series  generated  comes  from  the 
specified  model.  As  series  length  increases,  MSE  and  M  AD  should 
get  smaller.  Hence,  the  use  of  MSE  and  MAD  is  more  meaning¬ 
ful  for  oTiluating  the  quality  of  long  generated  series.  If  MSE  and 
M  AD  do  not  approximate  a  monotone  decreasing  sequence  for  in¬ 
creasing  n,  then  the  simulation  algorithm  employed  is  unaccept¬ 
able. 

One  problem  with  the  autocovariance  approach  to  measuring 
goodness-of-fit  is  that  the  values  MSE  and  MAD  may  not  pro¬ 
vide  a  true  measure  of  closeness  for  small  scries  because  the  au- 
tocovariance  estimator  is  biased.  Thus,  it  is  conceivable  that  a 
generated  series  niay  be  better  than  another  series  in  some  sense, 
but  may  produce  a  biased  estimate  of  the  autocovariance  function 
that  is  worse  than  the  biased  estimate  produced  by  the  other  se¬ 
quence.  W'liiie  it  is  unlikely  that  the  bias  will  uniformly  favor  one 
method  over  another  even  when  that  method  is  inferior  by  most  cri¬ 
terion  one  might  devise,  one  should  nonetheless  be  made  aware  of 
the  potential  pitfalls  in  this  method  of  comparison.  If  significant 
differences  are  noted,  one  explanation  is  that  the  better  method 
produces  sequences  that  minimize  bias  in  estimation. 

Note  also  that  for  fixed  secies  length  n.  MSE  or  M  AD  may  fa¬ 
vor  a  series  that  is  generated  to  have  smaller  innovation  variance 
than  another  series.  This  is  primarily  because  MSE  or  MAD 
may  be  .small  when  is  close  to  zero.  On  the  other  hand.  MSE 
and  M  AD  values  will  tend  to  remain  constant  in  such  cases  across 
ail  sample  sizes,  whereas  series  generated  using  the  correct  inno¬ 
vation  variance  will  experience  MSE  and  MAD  values  that  get 
smaller  with  increased  series  length.  This  fact  also  negates  the  po¬ 
tential  negative  effect  of  bias  in  measuring  goodness-of-fit,  since  a 
faulty  generator  is  unlikely  to  produce  time  series  with  biased  au¬ 
tocovariance  estimates  that  uniformly  beat  the  estimates  produced 
by  a  better  generator  across  all  sample  sizes.  Recall  that  for  large 
sample  sizes,  the  bias  factor  tends  to  be  relatively  small. 

Other  approaches  exist  for  measuring  gondness-of-fit  of  gener¬ 
ated  time  scries.  Most  approaclies  will  have  the  same  pitfalls  as 
those  discus.sed  above.  The  numerical  overhead  required  in  iniple- 
inenting  many  goo<lncss-of-fit  meayures  places  a  severe  penally  on 
their  me  in  large  stalt*  simulation  sliolies. 


0.  A  MONTE  CARLO  EXPERIMENT 

An  experiment  ha.s  l)reri  performed  to  investigate  properties  of 
time  senes  generator^;.  Tlie  farior  of  interest  in  the  experiment  is 
the  type  f»f  uniform  generator  empioyeti.  There  are  six  levels  for 
factor  1  rorrespoiuliTig  to  six  algorithms  for  g<*nerating  p.setido  ran¬ 
dom  nuinbers.  The  six  pseudo  random  number  generators  eonsirl- 
ered  are 

1  multiplieative  rongriienlial  generator  wiili  rn  _ 

'1  multipl  n  at  I  \  e  eoiigrm'iiti.il  generai«ir  witli  m  _  lii-VOT 

;nu  It  i{)i  i«ati  ve  <ongruenlial  g<*neralor  wiiK  01  j  . 

•1  mnlti [ilicai  i ve  eongrtie nf  lal  g,*nerator  w  ilh  m  :  7  12'i;ts*2sr>. 

miilti[)li<  af ive  eongrueiit.iai  neneralor  with  ni  - 
b  'Ians  wort  he  gen,-r:it*jr  lI\eMiie-lv  and  tienlii*.  p 

i I 

'{he  itiodiilu.s  for  generators  (1)  through  (.'»|  is  ,1/  -  2'*’  -  1 
Hence,  these  geperal'irs  are  appropriate  for  comptiters  having  32 
bit  words.  Note  that  generators  (1)  tlirougli  M)  have  been  tnves- 


tigated  by  Fishman  and  Moore  (1982,  1986).  Generator  (1)  is  the 
notorious  RANDU  generator.  Generators  (2)  and  (3)  are  avail- 
able  in  IMSL®  (1  987).  Generator  (3)  is  also  available  in  SAS 
(19851)).  Fishman  and  Moore  (1986)  suggest  that  generator  (4)  is 
superior  to  generators  (1)  through  (3).  Generator  (5)  is  known  to 
have  poor  runs  properties  and  is  included  as  a  control.  If  genera¬ 
tor  (5)  cannot  be  judged  significantly  worse  than  the  other  genera¬ 
tors,  then  one  should  look  for  flaws  in  the  Monte  Carlo  experiment 
and  at  the  very  least  view  the  results  with  caution. 

In  addition,  two  factors  sample  size  and  ARMA  model  em¬ 
ployed,  are  required  to  attempt  to  generalize  the  results  to  a  wide 
variety  of  situations  likely  to  be  encountered  in  practice.  The  sam¬ 
ple  sizes  considered  are  n  =  50.  100,  and  500.  The  models  em¬ 
ployed  are: 

1.  Ansley  and  New  bold  (I'JSlj 

(1  -  0  SO/i  -  0.ti5/)'}y,  = 

2.  Ansley  and  Newbold  (1981) 

y,  =  (1  4-  1.257?  4-  0.357?”)e( 

3-  Ansley  and  Newbold  (19S1) 

{1  -  0.957?)K,  =  (1  ^  0.857?)o 


1.  Woodward  and  Gray  (1981) 

(1  -  1.5B  -  1,217?'  -  0,455B^)l  i  =  ( W  0,27?  t-  0.9H-)e, 

5.  Brockwell  and  Davis  (1987) 

(1  -  ZJ  -  U  =  (I  +  0.4B  ~  0.2/}^  - 

8-  iVewtoft  ami  J-’agano  (2  982) 

(1  -  0-33577?  0.0821  7?-  -  0  1-5707?’  -t-  0.25677?'')^; 

=  (1  -  0.6077/?  T  0  f  0  100:iB  =  )f, 


Ten  replications  were  performed  for  each  factor  level  combina¬ 
tion.  In  all  cases,  seeds  were  transmiiied  from  one  routine  to  the 
next  with  no  aiiempl  to  control  seed  values  or  synchronize  the  se¬ 
nes  generated.  The  response  variables  are  S  E  and  .17.4  79  defined 
above  for  the  autocovariance  function.  The  value  m  used  to  trun¬ 
cate  the  autoeovananre  sefpience  w;ls  chosen  so  that  i-.i,  <  0  OOOUl 
for  all  k  >  iti.  except  for  the  pure  MA  model  2.  in  which  case  m 
was  arbitrarily  set  to  ten.  Table  1  lists  the  values  of  rn  (trunca¬ 
tion  lag  value!  and  values  t/f  .UN7:  (mean  sum  id  squares)  and 
A/  I  D  (mean  al>solule  v.dmd  witli  -i,.  m-i  equal  to  zero  for  all  k. 

Tallies  2  and  3  provoie  lis.iing'  uf  the  geni-r/iiors  that  produced 
the  lowest  or  highest  eel!  m«  an  for  llie  given  sample  size  bv  mode] 
comiuiiatiou.  While  tju-  re'-uiis  for  the  lowi-si  e«dl  mean  do  not  ap¬ 
pear  lo  be  randcijuiv  disjoT'eil  m  r[ic  t.ible,  t!ie  taliie  suinm.'irinng 
the  number  mI  Umes  generati-r  lia'l  lowest  eel!  mean  pro\i<bs 

values  that  will  not  lead  to  rejection  of  a  null  iiypolhesis  that  the 
lowest  cell  mean  is  distributed  uniformly  across  generators. 

Mor  e  insight  is  obtained  from  table  3.  The  results  clearly  reject 
(iniforniity  and  strongly  imply  tlial  generator  5  is  inadeqnali-.  Gen¬ 
erators  4  and  6  also  have  an  unusually  high  number  of  counts,  al 
though  the  experiment  is  loo  sm.dl  to  allow  one  to  draw  anv  strong 
conclusions. 

Table  4  also  provides  strong  evidence  that  generator  5  is  mad 
equate  given  tliat  it  fads  in  several  situations  to  aiiecjuaieiv  gener 
ate  series  that  exhibit  the  moncjlone  decreasinc  beliavior  of  liie  re¬ 
sponse  variables.  The  non-slarrrd  items  in  table  4  represent  cases 
where  monotonicity  was  violaleil,  hut  onlv  for  sample  sizes  of  50 
and  100  The  non-slarred  item.s  also  reflect  small  increases  iliat 
are  probably  a  result  fif  sampling  err<»r  rather  ilian  generator  defi 
ciencies. 
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Since  the  Monte  Carlo  results  may  not  satisfy  the  assump¬ 
tions  to  carry  out  the  usual  parametric  ANOVA,  a  nonparametric 
ANOVA  was  performed  based  on  replacing  response  values  by  their 
Blom  normal  scores.  Initially,  all  F  tests  exhibited  p-values  smaller 
than  0.01.  Examination  of  the  cell  means  revealed  that  model  3, 
having  roots  near  the  unit  circle,  produced  MSE  or  MAD  val¬ 
ues  considerably  higher  than  those  for  other  models.  As  indicated 
above,  generator  5  edso  consistently  produced  unusually  high  val¬ 
ues  for  most  models.  When  model  3  and  generator  5  were  deleted 
from  the  study,  the  model  by  generator  and  model  by  sample  sire 
interactions  were  significant  at  the  five  percent  level.  However, 
there  is  not  enough  evidence  to  reject  any  of  the  remaining  gener¬ 
ators  as  being  inadequate.  If  all  generators  are  basically  of  equal 
quality,  it  is  not  surprising  that  a  statistically  significant  model  by 
generator  interaction  is  observed. 

Finally,  since  the  response  variables  MSE  and  MAD  are  more 
appropriate  for  larger  sample  sizes,  we  carried  out  the  ANOVA  for 
the  case  n  =  500.  Both  response  variables  lead  to  the  conclusion 
that  there  is  a  statistically  significant  interaction  between  model 
and  generator.  In  the  presence  of  interaction  we  can  only  draw 
conclusions  about  the  effect  of  the  generators  for  the  particular 
models  in  the  study.  There  is  no  compelling  evidence  to  imply  that 
any  of  the  five  remaining  generators  may  be  consistently  superior 
or  inferior  to  the  others. 

Most  results  were  consistent  whether  M S E  ot  M AD  was  used 
as  a  criterion  measure.  Any  discrepancies  may  have  been  due  to 
MSE  being  more  sensitive  to  outliers  than  MAD.  Results  also 
supported  the  consistency  of  the  sample  autocovariances.  In  this 
regard,  only  generator  (5)  can  be  declared  unacceptable  by  this 
analysis.  While  the  RAN  DU  generator  is  generally  considered  to  be 
inadequate,  it  was  not  rejected  in  out  study.  This  is  not  surprising 
because  RANDU  has  been  found  to  be  adequate  for  many  specific 
applications  that  ate  not  adversely  effected  by  RANDU’s  poor  n- 
space  uniformity  properties.  Further  research  is  warranted  using  a 
variety  of  criterion  measures  and  a  larger  experimental  design. 

7.  CONCLUDING  REMARKS 

When  choosing  an  algorithm  for  simulating  a  stationary  Gaus¬ 
sian  ARM  A  time  scries,  theoretical  considerations  narrow  the 
choice  to  efficient  exact  algorithms,  although  the  Psi  Weight 
Method  is  ideal  for  providing  a  quick  method  for  generating  time 
series  in  almost  any  computing  environment.  Choice  of  psemlo  ran¬ 
dom  number  generator  becomes  M’e  e^  ’  . .  designing  a 

simulation  routine  A  Monte  Carlo  stuuy  provides  evidence  to  in¬ 
dicate  that  some  of  the  more  popular  multiplicative  congruential 
generators  may  be  adequate  for  simulating  time  series  data.  Kil- 
lam  (1987)  indicates  that  the  non-statistical  tests  used  by  Fish¬ 
man  and  Moore  (1986)  may  not  be  as  meaningful  for  generators 
employed  in  many  statistical  applications  Our  preliminary  work 
supports  this  view. 

For  future  study,  known  properties  of  statistical  estimators 
should  be  investigated  with  data  simulated  using  the  algorithms 
discussed  in  this  paper.  A  larger  study  employing  more  genera¬ 
tors  is  warranted.  Only  when  expected  behavior  is  observed  to 
within  an  acceptable  tolerance  should  the  algorithms  then  be  used 
to  gain  insight  into  statistical  procedures  that  do  not  have  ade- 
rpiate  theoretical  underpinnings. 


Simulations  described  in  this  paper  were  carried  out  on  an 
Apollo  workstation  using  the  SAS  System  and  SAS./IML  software. 
VVrsinn  ij.03. 

SAS  and  5AS/IML  ar<*  r»*gistfrrd  trad'-marks  of  SAS  Institute 
Inr  IMSL  is  ci  registered  tra<lrniark  of  IMSL  Iik 


REFERENCES 

Anslcy,  Craig  F.,  and  Newbold,  Paul  (1980).  Finite  sample  proper¬ 
ties  of  estimators  for  autoregressive  moving  average  models. 
Journal  of  Economelncs,  13,  159-183. 

Anslcy,  Craig  F.,  and  Newbold,  Paul  (1981).  On  the  Bias  in  Esti¬ 
mates  of  Forecast  Mean  Square  Error.  Journal  of  the  Amer¬ 
ican  Statistical  Association,  76,  569-578. 

Box,  G.E.P.,  and  Jenkins,  G-M.  (1976).  Time  Senes  Analysis: 
Forecasting  and  Control.  Oakland,  California:  Holden-Day. 

Btockwell,  Peter  J.,  and  Davis,  Richard  A.  (1987).  Time  Senes: 
Theory  and  Methods.  New  York;  Springer- Verlag. 

Fishman,  George  S.,  and  Moore,  Louis  R.  (1982).  A  Statistical 
Evaluation  of  Multiplicative  Congruential  Random  Number 
Generators  with  Modulus  2^^  —  1.  Journal  of  the  American 
Statistical  Association,  77,  129-136. 

Fishman,  George  S.,  and  Moore,  Louis  R.  (1986).  An  Exhaus¬ 
tive  Analysis  of  Multiplicative  Congruential  Random  Num¬ 
ber  Generators  with  Modulus  2“’  —  1.  SIAM  Journal  of  Sci¬ 
entific  and  5^a^I5^ica/  Computation,  7,  24-45. 

IMSL®  (1987).  STAT/LIBRARY™  User's  Manual.  Houston: 
IMSL. 

Kennedy,  William  J.,  Jr.,  and  Gentle,  James  T.  (1080).  Statistical 
Computing.  New  York:  Marcel  Dekker. 

Killam.  Bart  (1987).  An  Overview  of  the  SAS'*^  System  Random 
Number  Generators.  Proceedings  of  the  Twelfth  Annual  Con¬ 
ference,  SAS  Users  Group  International.  Dallas,  Texas,  1059- 
1065. 

Knulh,  Donald  E  ( 1981 ).  The  Art  of  Computer  Programmwg,  2nd 
Edition.  Volume  2:  Seminumcncnl  Algorithms.  Reading, 
Massachusetts:  Addison-Wesley  Publishing  Company. 

Lewis,  T.G.,  and  Payne,  \V,H.  (197.5).  Generalised  Feedback  Shift 
Register  Pseudorandom  Number  Algorithm.  Joumal  of  the 
Association  for  Computing  Machinery ,  20,  456-‘lC8. 

McLeod,  Ian  (1975).  Derivation  of  the  Theoretical  Autocovariance 
Function  of  Autoregressive- .Moving  Average  Time  Series.  Ap¬ 
plied  Statistics.  24.  2-55-256 

_  (1977).  Correction  to  MeLeod  (1975).  Applied 

Statistics,  26,  194. 

Newton,  H.  Joseph,  and  Pagano,  Marcello  (1983).  The  Finite 
Memory  Prediction  of  Covariance  Stationary  Time  Scries. 
SIAM  Journal  of  Scientific  and  Statistical  Computing,  4.  330- 
339. 

Priestley,  M.B.  (10v81),  Spectral  Analysis  and  Time  Senes.  New 
York;  Academic  Press. 

SAS  Institute  Inc.  (I985:i).  SAS/LML®  User's  Guide,  VVr.sioa  5 
Edition.  Cary,  North  Carolina:  SAS  Institute  Inc..  73-75. 

SAS  Institute  Inc.  (1985b).  SAS®  Language  Guide /or  Personal 
Computers,  Version  6  Edition.  Cary,  North  Carolina:  SAS 
Institute  Inc. 

Wilson.  G.T  (1978).  Some  Efficient  Computational  Procedures  for 
High  Order  ARMA  Models.  Joumal  of  Statistical  Computa¬ 
tion  and  Simulation .  8.  301-309. 

Woodward.  VVayiu*  A.,  and  Gray,  H.L.  (1981).  On  the  Relation¬ 
ship)  Between  tlie  S  Array  and  the  Box-Jenkins  Method  of 
ARMA  Model  Idennfication.  Journal  of  the  American  Si.i- 
tistical  Association,  TO.  579-.587. 


1.  rrne  A niocovananre  Function  Summary  Values 


Truncation 

Mean  Sum 

Mean  Absolute 

Mod-! 

La-  Valu- 

of  Squares 

V^llue 

1 

5.5 

0,16742 

0.15111 

1* 

10 

1  01794 

0.47225 

3 

294 

42  S3101 

2.38645 

■1 

5<*. 

6  14362 

0,73537 

T, 

29 

4.86766 

0,98871 

(i 

49 

0  02877 

0.04132 
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2.  Summary  Tables  From  Monte  Carlo  Experiment 


4.  Cases  Where  MSE  or  MAD  Were  not  Monotone  Decreasing 
Table  entry  is  tha  number  of  the  generator  with  model  generator  model  generator  model  generator 


lowest  mean 

MSE. 

1  3 

3 

5  •  5 

5 

» 

1  4 

3 

6  (.MSE)  6 

2 

(.MAD) 

n\  model  1 

1  1 

2  1 

3  1 

4  1 

5  1 

6  1 

1  6 

4 

5  •  6 

S 

- - 

- +- 

- +  -■ 

- +  - 

- + 

- + 

- + 

2  6 

4 

6  6 

6 

50  I 

S  1 

1  1 

4  j 

6  1 

6  1 

6  1 

100  1 

5  1 

1  1 

2  1 

2  1 

6  1 

3  1 

•  In  all  cases  except  for 

the  starred 

Items , 

the 

600  1 

3  1 

4  1 

2  1 

1  1 

3  1 

2  1 

case  n=S00 

had  smallest 

MSE  and  MAD 

values , 

Table  entry  is  the  number  of  the  generator  with 
lovest  mean  HAD. 

n\  model  I  ll  21  3l  4l  5I  6| 


SO  I  SI  11  41  61  6i  SI 

100  !  51  1  I  2  t  21  6  !  31 

500  I  3  !  4  I  2  1  11  21  31 

- + - + - + - + - + - + - + 

Table  entry  is  the  number  of  times  the  given 
generator  had  lowest  mean  MSE  or  MAD. 

generator:  ll  21  3|  4l  Si  6| 

31  4|  31  3|  21  31 

- + - + - - - + - + - + 

3.  Summary  Tables  From  Monte  Carlo  Experiment 

Table  entry  is  the  number  of  the  generator  with 
highest  mean  MSB.  (Vhen  S  is  the  generator  with 
highest  MSE,  the  number  of  the  generator  with 
no^t  highest  MSE  is  given  in  parenthesis.) 


n\  model  1  1  1 

2  1  3  1  4  1 

5  1 

6  1 

50  1  2  1 

5(3)1  5(6)1  5(4)1 

6(2)1 

4  i 

100  1  41 

5(6)1  5(6)1  5(4)1 

5(1)1 

6  1 

SOO  '  5(4)1 

5(2)1  5(3)1  5(4)1 

5(1)1 

5(1)1 

Table  entry  is  the  number  of  the  generator  with 
highest  meaui  MAD.  (Vhen  S  is  the  generator  with 
highest  MAD,  the  number  of  the  generator  with 
next  highest  MAD  is  given  in  parenthesis.) 


n\  model  1  1  1 

2  1 

3  1 

4  ! 

5  1 

6  1 

so  1  2  1 

5(3)  1 

5(6)1 

5(4)  1 

5(2)1 

4  1 

100  1  61 

5(6)1 

6(6)  1 

5(4)  1 

6(1)1 

6  1 

SOO  1  5(4)1 

5(2)  1 

5(3)  1 

5(4)1 

5(1)1 

6(1)1 

, 

. 

, 

, 

. 

. 

Table  entry  is  the  number  of  times  the  given 
generator  had  highest  mean  MSE  or  MAO.  (Value 
in  parenthesis  is  number. of  times  if  generator 
6  is  omitted. ) 


generator 

1  1 

2  1 

3  1 

4  1 

5 

6  1 

MSE: 

0(3)1 

1(3)1 

0(2)  ( 

2(6)1 

14 

1(4)1 

MAD: 

0(3)1 

1(3)1 

0(2)  1 

1(6)1 

14 

2(5)1 

+ - + - + - *■ - + - + 
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ON  COMPARATIVE  ACCURACY  OF  MULTIVARIATE 
NONNORMAL  RANDOM  NUMBER  GENERATORS 


Lynne  K.  Edwards,  University  of  Minnesota 


Abstract 

There  are  two  easily  accessible  methods  of 
generating  multivariate  nonnormal 
distributions  using  the  IMSL.  They  are:  a 
multivariate  extension  of  a  power  method  with 
an  intermediate  correlation  matrix  adjustment 
and  a  normal-mixture  method.  Neither  these 
methods  can  produce  all  possible 
combinations  of  marginal  skew  and  kurtosis, 
but  they  have  an  advantage  over  the  known 
extreme  distributions  when  a  multivariate 
nonnormal  distribution  with  specified 
intercorrelations  and  specified  marginal 
moments  is  desired  for  simulating  a  plausible 
nonnormal  situation.  The  MSE  and  se(MSE) 
for  the  four  marginal  moments  and  for  the 
intercorrelations  were  compared  between  the 
two  methods.  The  Fleishman-type  method 
produced  a  much  smaller  bias  in  correlation 
coefficients  than  the  normal-mixture  method 
but  the  reversed  trends  were  found  for  the 
marginal  skewness  and  kurtosis. 

Keywords:  simulation  algorithms;  empirical 
moments:  MSE 

1.  Introduction 

Multivariate  nonnormal  random  numbers 
are  sometimes  generated  to  simulate  a 
realistic  nonnormal  distribution  with  specified 
four  marginal  moments  and  intercorrelations. 
As  in  the  case  of  univariate  nonnormal 
distributions,  the  known  multivariate 
nonnormal  distributions  have  highly  desirable 
properties,  such  as  density  functions.  But  they 
are  often  far  from  the  plausible  nonnormal 
distributions  that  are  encountered  in  testing 
and  experiments.  A  statistic  may  be  robust  for 
all  practical  purposes  under  plausible 
nonnormality  conditions,  while  it  may  exhibit 
nonrobustness  under  extremely  nonnormal 
conditions  which  are  almost  never 
encountered  in  real  studies. 

The  extension  of  known  extreme 
univariate  distributions  to  multivariate 
distributions  is  an  obvious  option  but  it  has 
several  shortcomings.  An  extreme  distribution 
provides  implausible  skew  and  kurtosis  and  it 
IS  often  difficult  to  specify  the  desired 
intercorrelations  among  the  variates.  For 
example,  a  log-normal  distribution  with  p  =  o 
and  o2  =  1  is  often  used  as  a  right-skewed 
nonnormal  distribution  but  it  has  skewness  of 
6.18  and  kurtosis  of  113.94,  a  far  departure 


from  a  typical  skewed  distribution  found  in 
psychological  and  educational  research  with 
skewness  of  0. 7-1.0  and  kurtosis  of  2. 0-5.0. 
Another  nonnormal  distribution  freguently 
used  in  simulations  is  a  Laplace,  but  a  "cusp" 
is  almost  never  found  in  real  data.  Still 
another  frequently  used  nonnormal 
distribution  is  a  chi-square  distribution  with 
degrees  of  freedom  ranging  from  2  to  3. 
Although  it  can  be  rescaled,  limited 
combinations  of  skewness  and  kurtosis  can  be 
generated. 

Various  algorithms  such  as  the  Johnson- 
system.  Burr-system,  and  Shmeiser-Deutch 
system  for  generating  flexible  distributions  are 
well  established  for  the  univariate  disributions 
(Rubinstein,  1981;  Burr,  1973;  Tadikamalla, 
1980;  Schmeiser  &  Deutch,  1977).  Although 
the  multivariate  extensions  of  such 
distributions  are  discussed  in  Johnson  (1987), 
they  tend  to  be  computationally  involved. 

An  alternative  approach  is  to  extend  the 
univariate  normal-mixture  method.  This 
method  has  an  intuitive  appeal  because  we 
can  think  of  a  marginally  skewed  dataset 
either  in  test  scores  or  in  a  repeated  measures 
design,  as  the  data  obtained  from  three 
distinct  subpopulations;  each  with  a  normal 
distribution  with  a  different  mean  and  a 
variance  but  with  the  same  correlation  matrix. 
One  of  the  drawbacks  of  this  method  is  that  it 
may  generate  multi-modal  distributions. 

Yet  another  approach  is  to  use  an 
approximate  distribution,  an  extension  of 
Fleishman's  univariate  power  method 
(Fleishman,  1978),  to  a  multivariate  situation. 
Vale  and  Maurelli  (1983)  have  shown  that  the 
multivariate  extension  works  reasonably  well 
with  their  intermediate  correlation  adjustment. 

The  last  two  methods  provide  an  intuitively 
simple  extension  from  the  univariate  to  the 
multivariate  in  simulating  testing  situations 
where  three  parallel  tests  are  given  to  the 
same  subjects,  or  in  general  when  the  same 
subjects  are  repeatedly  observed. 

The  purpose  of  this  study  is  to  compare  the 
Fleishman-type  power  method  and  a  normal- 
mixture  method,  two  relatively  easy  methods 
of  simulating  multivariate  nonnormal 
distributions  with  specified  moments  and 
intercorrelations.  It  is  of  interest  to  test  their 
relative  accuracy  on  the  marginal  moments 
and  intercorrelations. 
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2.  A  Power  Method 

A  multivariate  extension  of  the  Fleishman's 
power  method  (Vale  &  Maurelli,  1983)  is  as 
follows: 

Univariate  procedure:Y  =  a  +  bx  +  cx^  +  dx^ 

1.  Solve  the  following  nonlinear  equations 
(Fleishman,  1978). 

+  6bd  +  2c^  +  1 5d^  -  =  0 

2c{b^  +  24bd  +  1 0Sd^  +  2)  -  7^  =  0 

24[bd  +  0^(1  +  b^  +  28bd) 

+  6H^2  +  48bd  +  1410^  +  225d2)]  -y^  =  Q 

a  =  -c  if  ^  =  0 


3.  A  Normal-Mixture  Method 

A  multivariate  extension  of  a  normal- 
mixture  method  (Everitt  &  Hand,  1981)  is  as 
follows: 

1.  Generate  a  multivariate  normal  distribution: 
Normal  p(x:  p,  Z)  x  ~  N  (|j,,  Z) 

2.  For  a  symmetric  leptokurtic  distribution, 
solve  for  n^,  02,  S,,  Zgand  for  a  skewed 
distribution,  solve  for  n^,  Hg,  Z,,  Pg,  Zg 
and  fig,  Ig- 

Tailed  p(x:  p,  Z)  =  p(x^:  p,  Z^) 

+  Hg  p(Xg:  p,  Zg). 

Skewed  p(x:  p,  z)  =  p(x^:  p^.  s^) 

+  Og  P(Xg;  Pg,  Zg)  +  Hg  P(Xg;  Pg,  Zg). 


2.  Generate  a  unit  normal  variate,  x,  for  each 
variate  needed. 

3.  Solve  for  the  intermediate  correlation 
matrix  in  x's  for  the  obtained  coefficients:  a,  b, 
c,  and  d.  The  matrix  elements  below  and  the 
polynomial  function  for  solving  for  the 
correlations  in  x's  to  be  specified  are  fully 
reported  in  Vale  and  Maurelli  (1983),  but  the 
conditional  expectations  can  be  easily  applied 
to  solve  for  them.  The  intermediate 
correlations  are  typically  larger  than  the 
specified  values  to  counteract  the  attenuation 
in  correlations  resulting  from  the  power 
transformation  of  x's. 


R  =  E(x^,  Xg')  = 


0  1  0 

PX1X2  ®  ^PxiX2 

0  p2^,^+1  0 

^PxiX2  ®  ®P  X1X2'^X1X2 


W  =E(Y,.Y2')  =  E(w',x^x'gWg)=w'^RWg 
=  PxiX2(^‘^2  +  3b,dg4-3d,bg-i9d,dg) 

P  X1X2^^^1^2)  P  X1X2^®*^1*^2^ 


4.  Apply  a  triangular  decomposition 
(Cholesky's)  to  the  obtained  intermediate 
correlation  matrix  R,  and  produce  x*  =  x  L’ 
where  LL'  =  R. 

5.  Apply  the  coefficients  a,b,  c,  and  d  to  x*. 

Y  =  w'x* ,  where  w'  =  [a,b,c,d]  and 

X*’  =  [1 ,  X,  x^,  x^l 


3.  If  a  multi-modal  distribution  is  to  be  avoided, 
a  sufficient  condition  has  to  be  satisfied 
(Everitt  &  Hand,  1981). 

(F2  -  )^  <  27  /  4(a\  -t-  ) 

Table  1.  The  parameters  for  the  mixed 
distributions  used 


type  marginal 
p  0^ 

marginal 

components 
n  p  0^ 

norm  0 

1 

0.00  0.00 

1.00 

0.0  1.00 

tailed  0 

1 

0.00  3.1212  0.80 

0.0  0.49 

0.20 

0.0  3.04 

skew  0 

1 

1.062  2.4366.  0.33 

-0.4  0.25 

0.33 

-0.2  0.25 

0.33 

0.6  1.94 

4.  Data  Generation 

Two  plausible  nonnormal  distributions  with 
specified  moments  (Table  1)  were  generated 
with  the  IMSL  (IMSL,  1986)  by  a  power 
method  and  a  normal-mixture  method.  The 
correlations  were  set  to  =  0.79,  pi3  =  0.53, 
and  p23  =  0.73.  Although  a  univariate  unit 
normal  can  be  used  for  generating  x's  in  the 
power  method,  the  multivariate  unit  normal, 
GGNSM,  was  used  in  order  to  reduce 
variances  in  comparing  the  two  methods.  For 
a  normal-mixture  method,  GGNSM,  GGBN, 
and  GGMTL  were  used  to  generate  the  mixed 
distributions. 
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The  first  simulation  was  conducted  with  the 
Cyber  855,  simulating  N=2000  with  50 
replications.  Because  a  negative  bias  in 
kurtosis  and  huge  MSB's  in  kurtosis  and 
skewness  in  the  power  method  were  noted,  50 
new  independent  simulations  were  conducted 
with  the  Cray  2/4  to  ascertain  the  accuracy  of 
the  results  and  to  obtain  the  se(MSE)'s  (Table 
2).  These  50  independent  simulations 
represent  50  independent  sets  of  N=2000  with 
50  replications  each.  Randomly  chosen  50 
seeds  were  used  to  generate  these 
independent  simulation  sets  across  four 
distributions.  The  figures  reported  in  Table  2 
represent  the  average  of  such  independent 
experiments.  The  se(MSE)  is  the  variability  of 
simulations  across  50  independent 
experiments. 

5.  Simulation  Results 

Both  Fleishman's  and  mixture  methods  are 
limited  in  the  type  of  nonnormal  distributions 
they  can  generate.  However,  they  are  easy  to 
use  with  the  help  of  the  IMSL  and  are  of 
reasonable  accuracy  for  researchers  needing 
plausible  nonnormal  distributions  with 
specified  marginal  moments  up  to  the  fourth, 
and  with  specified  intercorrelations. 

Within  the  limits  of  the  distributions  tested: 

1 .  Fleishman's  method  is  superior  to  a  mixture 
method  in  generating  the  data  with 
intercorrelations  close  to  the  population 
values  and  with  smaller  MSB's.  In  particular,  a 
mixture  method  produced  highly  positively 
biased  intercorrelations  when  each  marginal 
was  skewed  in  the  same  direction. 

2.  A  normal-mixture  method,  a  three- 
distribution  mix  for  a  skewed  distribution,  and 
a  two-distribution  mix  for  a  tailed  distribution, 
was  superior  to  Fleishman's  method  in 
generating  data  closer  to  the  specified 
skewness  and  kurtosis  and  with  smaller 
MSB's. 

It  is  understandable  that  a  power  method 
produced  a  set  of  intercorrelations  close  to  the 
specified  values  because  of  the  intermediate 
correlation  adjustment.  If  a  robustness  study 
involves  a  statistic  which  is  highly  dependent 
on  the  sample  intercorrelations,  a  power 
method  may  be  more  desirable  thaa  a  normal- 
mixture  method.  On  the  other  hand,  if 
specified  skewness  and  kurtosis  accompanied 
by  small  MSB  are  required  in  a  simulation,  a 
normal-mixture  method  which  has  finite  higher 
moments  for  each  component  normal 
produces  smaller  biases  and  variances  and  is 


more  stable  across  independent  simulations 
as  indicated  by  small  se(MSE)'s. 
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Figure  1.  Mean 


Figure  2.  Variance 


Skew-FSkew-MTail-F  Tail-M  Skew-FSkew-MTail-F  Tail-M 


Figures.  Skewness 


Figure  4.  Kurtosis 


Figure?.  Bias 


Mean  VarianceSkewnessKurtosis  corr12  corr13  corr23 


Table  2.  Mean  MSE  and  selMSE): 

Average  o(  50  experiments  lor  N=2000  with  50  replicalions 


SiQlMig 


Mix  Tail 

F  Tail 

Mix  Skew 

FSkew 

Parameter 

Mean 

MSE 

se(MSE) 

00 

0,000630 

0  000484 

ooooots 

0  0 

0000432 

0  000471 

0  000016 

00 

0  000316 

0  000495 

0  000016 

00^ 

0  000450 

0  000474 

0  000016 

Parameter 

Var 

MSE 

se(MSE) 

10 

0  997878 

0  002532 

0  000062 

10 

0  999318 
0002714 

0  000061 

1  0 

0999242 

0  002271 

0  000060 

1  0 

0  998956 

0  002218 

0  000053 

Parameter 

Skew 

MSE 

se(MSE) 

00 

0003381 

0  028467 

0  000664 

00 

0000689 

0  039084 

0  001193 

1  062 

1  059118 
0009048 

0  000261 

1  062 

1  056038 

I  131984 

0  005053 

Parameter 

Kurlosis 

MSE 

sn(MSE) 

3  1212 

3  091182 

0  278781 

0  008050 

3  1212 

3  032055 

10  744024 

0  315243 

2  4366667 
2  421257 

0 119259 

0  004014 

2  4366667 
2  391851 

6  633025 

0 1 48044 

Parameter 

corr12 

MSE 

se(MSE) 

0  79 

0  789920 

0  000152 

0  000004 

0  79 

0  790241 

0  000092 

0  000002 

0  79 

0  829288 

0  001638 

0  000015 

0  79 

0  790128 

0  000097 

0  000003 

Parameter 

corrt3 

MSE 

se(MSE) 

0  53 

0  529925 

0  000532 
0.000016 

053 

0  530370 

0  000289 
0000009 

0  53 

0  61 7893 

0  008074 

0  000058 

053 

0  530445 

0  000322 

0  000010 

Parameter 

corr23 

MSE 

se(MSF) 

0  73 

0  730281 

0  000223 
0,000006 

0  73 

0  730270 

0  000130 

0  0(X)004 

0  73 

0  780562 

0  002697 

0  000025 

0  73 

0  730394 

0  000141 

0  000004 

Mole  The  (>ay2'4  was  ur-rd  for  Ihis  run  Tho  moan,  vorinrico.  pkpwiin^'r 
and  kijrlosi";  lor  the  Irtsi  vanale  are  reported  Tire  avpraao  ol  50 
independrinf  simolalions.  each  with  N-POOO  repeated  50  times,  is  rpi'inrloft 
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ROBUSTNESS  STUDY  OF  SOME  RANDOM  VARIATE  GENERATORS 


Lih-Yuan  Deng,  Memphis  State  University 


Abstract 

Empirical  study  using  computer-generated  random  num¬ 
bers  have  been  widely  used  where  the  mathematics  of  ana¬ 
lyzing  a  statistical  procedure  become  intractable. 

There  are  several  generating  methods  to  produce  a  ran¬ 
dom  sequence  with  the  given  distril  ution.  Most,  if  not  all, 
of  the  methods  are  based  on  the  gei  eration  of  independent 
variate  from  an  uniform  random  distribution.  Comparison 
of  the  different  generating  methods  usually  is  done  under 
the  criterion  of  ’’efficiency”.  With  the  wide  availability  of  a 
wide  variety  of  computers,  the  cost  of  computing  is  reducing 
dramatically.  Computational  efficiency  should  not  be  the 
only  criterion  in  choosing  among  different  random  number 
generators.  We  will  propose  a  new  criterion, ’’roiustness”, 
to  compare  the  performance  of  different  generating  schemes. 

They  are  two  basic  techniques  for  generating  variates 
from  U(0,1):  the  congruential  methods  and  feedback  shift 
register  methods.  None  of  these  is  known  to  generate  a 
"true”  random  sequence.  In  this  paper,  using  beta  random 
variate  generating  methods  as  an  example,  we  will  compare 
the  performances  of ’’robustness”  of  several  generators.  It  is 
shown  that  some  methods  will  perform  poorly  in  the  sense 
that  it  will  quite  differ  from  the  specified  distribution  when 
the  uniform  generator  fails  ’’slightly”. 

1.  Introduction 

The  beta  family  of  distribution  with  the  p.d.f.  given  as 


fx(x) 


r(«  +  fe)  .-1 

r{a)V{b) 


0,  elsewhere 


0  <  I  <  1, 


was  used  to  model  random  processes  with  a  finite  range 
because  of  its  various  values  of  the  parameters  allow  many 
shapes  of  the  p.d.f. 

There  are  several  popular  generating  methods  for  ar¬ 
bitrary  value  of  the  parameters.  Based  on  the  rejection 
method,  Jbhnk(1961)  first  develop  a  generating  method  for 
arbitrary  a,b.  Ahrens  and  Dieter(l974)  proposed  a  more 
efficient  generating  method  when  both  a,b  are  greater  than 
one.  Other  methods  will  be  more  efficient  under  some  spe¬ 
cial  type  of  the  parameter  values,  see  Atkinson  and  Whit- 
taker(1976).  Cheiig(  1978)  compared  these  and  other  method 
for  generating  beta  variate  based  on  the  criterion  of  effi¬ 
ciency. 

Note  that  all  the  methods  are  based  on  the  successful 
generation  of  independent  variate  from  an  uniform  random 
distribution,  i.e.  The  theoretical  distribution  of  the  variates 
generated  by  each  method  will  follow  the  beta  distribution 
with  the  given  parameters  a,  6,  if  one  can  generate  a  truly 
uniform  numbers.  They  are  two  basic  techniques  are  widely 
used:  the  congruential  methods  and  feedback  shift  register 
methods.  None  of  these  is  known  to  generate  a  "true”  ran¬ 
dom  sequence.  In  fact,  an  uniform  random  number  gener¬ 
ator  passed  a  sequence  of  statistical  test  Ti,T2,...T„,  there 
is  no  guarantee  that  it  will  pass  a  further  test  Tn+i.  In 
practice,  an  uniform  random  number  generator  will  be  con¬ 
sidered  ’’random”  if  several  slatLslical  tests  has  be  passed. 
Another  problem  of  an  uniform  random  number  generator  is 
that  it  may  sometimes  display  some  locally  non-random  be¬ 
havior  ,  i.e.  a  block  of  numbers  toward  some  bias,  whereas 
next  block  toward  the  opposite  bia.s.  for  further  discu.ssion, 
see  Knuth(19G9)  and  Kennedy  and  Gentle(l980). 


In  this  paper  we  are  concerned  with  the  quality  of  a  given 
random  number  generator  under  the  situation  that  we  failed 
to  generate  a  truly  uniform  random  numbers.  Note  that  if 
a  truly  uniform  random  numbers  can  be  generated,  then  ail 
proposed  methods  will  yield  the  desired  distribution.  And 
the  only  criterion  to  compare  generating  methods  usually  is 
"efficiency” .  As  we  shall  see  in  Section  3,  that  the  resulting 
distributions  may  be  quite  different  under  the  "alternative" 
distributions.  In  the  next  section,  we  will  consider  two  sim¬ 
ple  generating  methods  of  a  special  beta  distribution,  with 
a  =  1, 6  =  n. 

2.  Comparisons  of  Generating  Methods 

It  is  easy  to  see  that  the  following  will  generate  a  random 
variate  with  distribution  beta(l,n); 

A  =  max(t/i,t;2,...,t/„)  (A) 

and 

Y  =  Un  (B) 

where  Ui ,  t/2, ...,  f/„  i.  i.  d.  ~  f7(0,l)  and  U  ~  C/(0,1). 
One  can  easily  see  that 

(1)  When  n  is  small,  method  (A)  will  be  more  efficient 
than  (B)  because  n-th  root  computation  will  take 
a  longer  computing  time  than  sorting  a  small  array 
of  numbers. 

(2)  When  n  is  large,  method  (A)  will  be  less  efficient 
than  (B)  because  n-th  root  computation  will  take 
a  shorter  computing  time  than  sorting  a  large  ar¬ 
ray  of  numbers.  The  computing  time  of  method 
(A)  is  known  to  be  proportional  to  nlogn,  where 
method  (B)  computing  time  is  independent  of  the 
size  n. 

(3)  Another  advantage  of  method  (B)  over  method 
(A)  is  that  it  can  still  be  used  even  when  n  is  non- 
integer,  where  method  (A)  will  fail  for  non-integer 
n. 

In  the  next  section,  we  will  consider  the  question  of  "robust¬ 
ness”  of  these  two  generate  rs.  That  is,  if  t/,’s  do  not  follow 
an  uniform  distribution,  then  which  generators  will  produce 
random  variates  with  distrib  Jtion  closer  to  beta(l,n)  distri¬ 
bution? 

3.  Robustness  of  Generating  Methods 

Note  that  the  distribution  of  X  and  Y  will  follow  a 
bcta(n,I)  distribution  if  f/j,  t/2,  ...f/n  and  U  in  (A)  and  (B) 
follow  an  uniform  distribution  over  [0,lj.  We  will  assume 
that 

Ui,U2,...,Un  i.  i.  d.  ~  Fs{l) 

and 

U  ~  Fe{t) 

where  Fg[t)  is  the  c.d.f.  of  a  distribution  over  |0,l]  which  is 
close  to  but  not  exactly  the  uniform  distribution,  where  6 
can  be  considered  as  parameter  of  the  family  of  "neighbor¬ 
hood”  distributions  around  uniform  distribution.  Without 
loss  of  generality,  we  assume  when  9  =  0,  Fo(()  is  the  c.d.f. 
of  the  uniform  distribution.  One  ran  e.a.sily  derive  the  cu¬ 
mulative  distribution  function(c,d.f.)  as  following; 

■-  0 

n 

<()  [FM.^  (I) 

1  I 
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Frit)  =  Priur.  <  «)  =  Fg{n.  (2) 

The  c.d.f.  of  Z  ~  beta(n,l)  's  given  a? 

Fz(t)  =  (3) 

To  study  the  relationship  among  Fx(0>  (0  ^'z(0» 

we  will  make  some  assumptions  about  Let  fs(t)  be 

the  p.d.f.  of  F${t)  and  for  0  <  (  <  1, 

feW  =  1  +  (4) 

Note  that  g${t)  represents  the  "deviation”  of  f$(t)  from  the 

p.d.f.  of  the  uniform  distribution.  We  will  assume  that 
g$(t)  is  bounded  and  continuous  function  both  in  t  and  in 
0.  Denote  the  maximum  positive  and  negative  deviation  of 
fe{t)  as 

(5) 

and 

The  c.d.f.  can  be  written  as 

Fg{t)  =  t  +  Ge{l),  0<t<l,  (7) 

where  ^ 

Gs(t)  =  f  g0(u)du.  (8) 

Jo 


It  is  easy  to  show  that 

Lemma  l.  >0,(g>  Oand(l-f^)  <  /g(c)  <  H  +  (g). 

PROOF:  Using  (7),  we  can  sec  that  Gg(0)  =  Gg(l)  =  0. 
From  the  Mean  Value  Theorem,  we  have 


0  =  Gg(I)  -  Gg(0)  =  (1  -  0)j^{A)  '■or  some  A  €  [0.1) 
Therefore,  we  have  shown  that 


-(g  =  <  0  =  <;^(A)  <  ^rnax^5^(()  =  f^. 


Lemma  1  follows  from  ('l)-(6).  I 

The  relationship  among  /■^x(0i  A'(0  sum¬ 

marized  as  in  the  following  theorem: 

Theorem  i .  ForO  <  t  <  i, 


From  (7),  we  have,  for  0  <  (  <  1, 

^  =  1  +  ^.  (II) 

Applying  the  Mean  Value  Theorem,  we  have 

=  {t  -  O)gg(X),  for  some  A  €  [0,(j. 

(12) 

Plug  (12)  in  (11),  we  get 

P  (f'\ 

—  =  1  +  pj;(A),  for  some  A  G  [0,i]. 

and  therefore 

{l-e,-)<:^<(l  +  e;).  (13) 

Inequalities  in  (9),  (10)  follows  easily  from  (13).  I 

Theorem  1  shows  that  has  a  tighter  bound  than 

Therefore  Fy{1)  can  be  much  closer  to  Fz{t)  than 
Fx(t)  to  Fz{t)  and  the  difference  will  be  more  dramatic 
when  n  is  large.  We  will  show  in  our  next  theorem  that 
the  same  conclusion  holds  true  when  comparing  their  cor¬ 
responding  p.d.f. ’s. 

Let  /x(t)./l'(0  and  fz{t)  be  the  p.d.f  of  Px(0.^f(0 
and  Fz(i),  respectively.  Taking  the  derivatives  from  their 
c.d.f. ’s  in  (l)-(3),  they  can  be  written  as  the  following:  {  for 
0  <  t  <  1  ) 

fx(i)  =  nm)r'fc.(t),  (H) 

fv{t)  =  nMni'^-^  (15) 

and 

/z(0  =  (16) 

The  relationship  among  /.v(0i/f{0  and  fz{t)  is  sum¬ 
marized  as  in  the  following  theorem: 

THEOREM  2.  ForO  <  t  <  1,  we  have 

(I-t,-)'-<  g|[|<{l-rr;)'‘  (17) 


a/ju 

" mo 

PROOF:  Form  (M),  (15)  and  (16),  we  c;in  sec  that 


(1 

PROOF:  From  {l)-(3),  wc  have,  for  0  <  (  <  1, 


fxji)  [/'MO' 
fy.it)  i  '  . 


I-xW 

[/•MO]" 

t'AO 

t  \ 

I'vii) 

KAn 

yy{t) 

t" 

fAt) 

From  Lemma  1,  we  know  that 

(1  )  •  /,.(oj-('")  ■  (1  •  <; ) 

'I'heornn  3  follows  ea^'iiy  fmin  (lO)-('Jl).  I 
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Theorem  2  again  shows  that  has  a  tighter  bound 
than  Therefore  /y(t)  can  be  much  closer  to  fz(t) 

than  fx\t)  to  fz(t)  when  the  true  distribution  of  the  "uni¬ 
form  generator”  is  not  really  uniform,  especially  when  n  is 
large.  Theorems  1  and  2  show  that  method  (B)  is  more  "ro¬ 
bust”  than  method  (A).  One  intuitive  argument  that  (A) 
is  not  as  robust  as  (B)  is  because  X  will  follow  beta(l,n) 
only  when  eacA of  tf,  i.  i.  d.  f/(0, 1),  whereas  y  ~  beta(l,n) 
whenever  U  ~  [7(0,1). 

An  extensive  empirical  study  of  comparing  the  robust¬ 
ness  of  beta  random  number  generators  as  well  as  other  ran¬ 
dom  number  generators  has  been  under  investigation  and 
will  be  reported  elsewhere. 

4.  Summary 

We  have  shown  that  if  the  true  distribution  of  the  so- 
called  ’’uniform  random  generator”  is  slightly  different  from 
U(0,1),  then  various  generators  may  yield  quite  different 
distribution  than  the  one  we  try  to  generate.  With  the  wide 
availability  of  the  cheaper  and  faster  computers,  one  should 
not  be  concerned  mainly  with  the  cost  of  computing  time. 
That  is,  the  efficiency  should  no  longer  be  the  only  criterion 
to  compare  the  performance  of  the  generators.  We  propose 
in  this  paper  to  adopt  a  new  criterion  like  "robustness”  to 
compare  the  performance  of  different  generating  schemes. 
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A  RATIO-OF-UNIFORMS  METHOD  FOR  GENERATING  EJtPONENTIAL  POWER  VARIATES 


Dean  M.  Young.  Baylor  University 
Danny  W.  Turner,  Baylor  University 
John  W.  Seaman,  Jr. ,  University  of  Southwestern  Louisiana 


1 .  Introduction 

The  standardized  exponential  power  distribution 
(EPD)  family  has  probability  density  function 

1 

g  (x)  -  - exp(  -Ixf*),  -«<x<«<i,  a>l. 

“  2r(l  +  1/a) 

This  family  is  symmetric  about  zero  and  contains 
members  with  a  variety  of  tail  shapes  from  the 
uniform  (a  -*  «>)  to  the  normal  (a  -  2)  to  the 
double  exponential  (a  1)  .  Because  of  the 
diversity  of  available  tall  shapes,  the  EPD 
family  has  proven  useful  in  robustness  studies. 

For  a  review  of  such  applications  and  others,  see 
Box  and  Tiao  (1973)  and  Tadikamalla  (1980). 

Johnson  (1979)  and  Johnson,  Tietjen,  and 
Beckman  (1980)  have  provided  direct  transformation 
methods  for  EPD  random  variate  generation. 
Tadikamalla  (1980)  has  derived  generalized 
rejection  techniques  for  EPD  random  variate 
generation.  He  has  provided  two  algorithms, 
called  ED  for  1  <  a  <  2  and  EN  for  2  <  o  <  6. 
(Values  of  a  greater  than  6  are  of  little  interest 
because  of  their  extreme  kurtosis  values.) 
Tadikamalla  has  found  the  combination  of  ED  and 
EN,  hereafter  referred  to  as  ED/EN,  to  be  superior 
to  the  gamma  transformation  methods  for  1  <  a  £  6. 
For  convenience  we  present  ED  and  EN  below.  See 
Tadikamalla  (1980)  for  more  discussion  of  these 
algorithms.  We  denote  the  uniform  distribution 
over  a  set  S  by  US .  We  denote  the  normal 
distribution  with  mean  m  and  variance  v  by 
N(m,v) . 

Algorithm  ED:  (for  1  ^  a  <  2) 

Step  0.  Compute  A  -  1/q,  B  -  A^ 

(Required  once  for  each  a). 

Step  1.  Generate  a  double-exponential  variate  X; 

a)  Generate  U  from  U(0,1). 

b)  If  U  >  .5,  then  X  -  B( - ln(2 ( 1 -U) ) ) . 
Otherwise,  X  -  Bln(2U) . 

Step  2.  Generate  R  from  U(0,1). 

Step  3.  Test  of  acceptance/rejection:  If 

ln(R)  >  (  -  I  X  f*  +  I  X  l/B  -  1  +  A),  then  go 
to  Step  1.  Otherwise,  return  X. 


Algorithm  EN:  (for  2  S  a  ) 

Step  0.  Compute  A  -  1/a,  B  -  A  . 

(Required  once  for  each  a) . 

Step  1.  Generate  X  from  N(0,B^). 

Step  2.  Generate  R  from  U(0,1). 


Step  3.  Test  of  acceptance/rejection:  If 

ln(R)  >  ( -  I  X  +  X2/2B2  +  A  -  .5), 
then  go  to  Step  1.  Otherwise,  return  X. 

In  this  paper  we  develop  a  simpler,  ratio-of- 
unlforms  method  of  EPD  random  variate  generation 
and  compare  it  to  Tadikamalla' s  ED  and  EN 
algorithms  for  1  <  a  <  6.  It  is  found  that 
generation  times  for  the  ratlo-of -uniforms  method 
are  uniformly  better  than  ED  and  EN  in  this 
range . 

2.  The  Ratlo-of -Uniforms  Method  for  EPD  Variate 
Generation 

In  this  section  we  shall  briefly  review  the 
ratio-of -uniforms  (ROU)  method  and  apply  it  to  EPD 
variate  generation.  For  a  more  thorough  review  of 
the  ROU  method,  see,  for  example,  Devroye  (1986). 

First  proposed  by  Kinderman  and  Monahan  (1977), 
the  ROU  method  has  been  studied  by  several 
authors,  including  Kinderman  and  Monahan  (1979), 
Cheng  and  Feast  (1979),  Robertson  and  Walls 
(1980),  and  Barbu  (1983).  The  method  is  based  on 
the  following  result,  due  to  Kinderman  and  Monahan 
(1977). 

Theorem  2.1  Suppose  f  is  an  nonnegative 
integrable  function  on  the  real  numbers.  Let  the 
random  vector  (U,V)  have  a  uniform  distribution 
over  the  set 


D  -  ((u.v):  0  <  u  <  V  f(v/u)  ). 

Then,  V/U  has  density  f/c ,  where  c  -  2[area(D)l. 

The  basic  idea  is  to  enclose  D  in  some  simple 
set  E,  generate  observations  from  a  uniform 
distribution  on  E,  and  apply  the  rejection 
principle.  The  following  result  is  proved  in 
Devroye  (1986,  p.  195). 

Theorem  2.2  Let  f  and  D  be  defined  as  above.  Let 
b,  a.  ,  and  a,  be  constants  such  that 


b  >  sup^  \/f (u) , 
a.  <  inf^  u\/f(u) ,  and 
a,  S  sup^  u\/f(u). 

Let  E  be  the  rectangle  formed  by  the  Cartesian 
product  [0,b)  x  [a., a,].  Then,  the  set  D  can  be 
enclosed  in  E  if  and  only  if  f(u)  and  u^f(u)  are 
bounded  for  all  u.  With  these  theoretical 
results  in  place,  we  now  present  the  general  ROU 
algorithm.  Again,  we  denote  the  uniform 
distribution  over  a  set  S  by  US. 

Algorithm  ROU: 

Step  0.  Compute  b,  a. ,  and  a,  (required 
once  for  each  a) . 

Step  1.  Generate  U  from  U(0,b). 
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Step  2.  Generate  V  from  U[a.  ,a*]. 

Step  3.  Set  X  -  V/U. 

Step  4.  If  U*  S  f(X),  then  return  X.  Otherwise, 
go  to  Step  1. 

For  the  EPD  application,  we 

take  f  (x)  -  exp(-  I  x  f*)  .  ^  , 

It  is  easy  to  .show  that  b  -  1,  a*  -  (2/ea)  '**,  and 
a.  -  -(2/ea)  ^  satisfy  the  conditions  of  Theorem 
2.2.  Note  that  only  one  calculation  is  required 
in  Step  0  since  a.  -  -a*  and  b  is  constant  for  all 
a. 


generators  "[simplicity  and  readability  are] 
perhaps  the  most  neglected  in  the  literature." 
Algorithm  ROU  should  certainly  be  selected  over 
ED/EN  on  the  basis  of  these  criteria. 

We  now  consider  efficiency  of  the  algorithms. 
Devroye  (1986,  p.  196)  has  derived  the  expected 
number  of  iterations  per  variate  produced- -the  so- 
called  rejection  constant- -for  the  general  ROU 
algorithm.  Its  general  form  is  given  below: 

b(a*  -  a.)  2b(a+  -  a.) 

’^ROU  "  ~~ 

area(R)  C 

<tCO 


3.  Comparison  of  the  Algorithms 

We  begin  by  considering  ease  of  implementation. 
Set-up  times  for  ROU  and  ED/EN  are  comparable, 
requiring  the  generation  of  one  constant 
involving  a  single  exponentiation  (Step  0  in  each 
algorithm).  However,  in  simulations  involving 
more  than  one  a  value,  one  must  decide  whether  to 
use  ED  or  EN  according  to  the  value  of  a  so  that 
set-up  for  ED/EN  is  slightly  more  complicated. 

Algorithm  ROU  requites  the  generation  of  two 
uniform  random  variates  (Steps  1  and  2)  which  must 
be  combined  in  a  ratio  (Step  3) .  One  comparison 
is  made  (Step  4)  involving  an  evaluation  of  Che 
function  f  (x) .  In  contrast,  an  application  of  ED 
requires  tfie  generation  of  two  uniform  random 
variates  (Steps  1  and  2),  both  of  which  must  be 
evaluated  in  a  logarithmic  function  (Steps  1  and 
3).  Furthermore,  two  comparisons  are  required 
(Steps  1  and  3),  one  of  which  (Step  3)  requires 
the  evaluation  of  a  function  more  complicated  Chan 
f  (x) .  Algorithm  EN  is  similarly  complex  but 
involves  the  calculation  of  a  uniform  and  a  normal 
deviate  rather  than  of  two  uniforms.  Clearly,  ROU 
is  much  simpler  than  ED/EN.  Devroye  (1986,  p.  8) 
has  noted  that  among  factors  that  play  an 
important  roles  in  the  choice  of  random  variate 


Written  as  a  function  of  o,  the  rejection 
constant  for  the  EPD  family  is  given  by 


’^ROU 


(o) 


2(2/ae)^/“ 

r(i  +  1/a) 


0  <  o  <  ®. 


Tadlkamalla  (1980)  has  provided  rejection 
constants  for  ED  and  EN.  As  functions  of  a  they 
may  be  written  as  the  following: 


and 


’^EN'"^ 


(1/q)^/“  (1  -  1/Q) 


’^ed(“>  - 


r(l  +  l/a) 


,  1  <  a  <  2, 


\/x/2(l/o)  ^/“(.5 

- e 

r(l  +  l/a) 


1/a) 


a  >  2. 


We  shall  use  these  rejection  constants  as 
measures  of  efficiency,  where  efficiency  is 
defined  as  the  reciprocal  of  the  rejection 
constant.  A  graph  of  the  efficiency  functions 
^'^’^ROU’  ^/"^ED’  shown  in  Figure  1 

various  values  of  o.  For  the  values  of  a  that 


for 


RGURE I 

Trimmed-Mean  Time 
To  Generate  10,000  Observations  from 
Exponential  Power  Distributions  for  Varied  Powers 
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of  Interest,  ED/EN  Is  uniformly  more  efficient 
than  ROU.  However,  efficiency  is  a  function  of 
expected  number  of  Iterations  per  variate.  It 
will  be  seen  that  the  greater  complexity  of 
algorithm  ED/EN  requires  longer  time  per  iteration 
relative  to  algorithm  ROU,  thus  negating  the  value 
of  higher  efficiency. 

To  compare  generation  speeds,  we  have  coded  the 
algorithms  in  SAS-PROC  MATRIX  on  an  IBM  4381  Model 
P22.  Algorithm  EN  requires  the  generation  of 
normal  random  variates.  The  SAS  function  RANNOR 
has  been  utilized  for  this  purpose.  The  PROC 
MATRIX  implementation  of  each  algorithm  has  been 
used  to  generate  50,000  variates  In  five 
Independent  runs  of  10,000  variates  each  for 
1  S  a  ^  6.  Extreme  generation- time  values  have 
been  trimmed  from  each  set  of  five  runs.  Figure  I 
gives  the  trimmed  average  generation  times  In 
seconds  for  each  algorithm.  As  can  be  seen  from 
Figure  I,  algorithm  ROU  exhibits  uniformly  faster 
performance  than  algorithm  ED/EN. 

4.  Conclusions 

We  have  presented  a  ratlo-of-unlforms 
algorithm,  called  ROU,  for  exponential  random 
variate  generation  and  have  compared  It  to  a 
generalized  rejection  method,  called  ED/EN, 
developed  by  Tadlkamalla  (1980).  We  have 
demonstrated  that  while  ROU  Is  Inferior  to  ED/EN 
with  respect  to  efficiency  (iterations  required 
per  random  variate) ,  It  Is  markedly  superior  In 
generation  time,  which  is,  practically,  the  most 
Important  measure  of  performance.  Furthermore,  a 
direct  comparison  of  algorithms  ROU  and  ED/EN 
clearly  Indicates  that  ROU  is  far  more  simple  and 
easily  Implemented.  Devroye  (1986,  p.  11)  notes 
that,  "It  Is  a  general  rule  In  computer  science 
that  speed  can  be  reduced  by  using  longer,  more 
sophisticated  programs."  Happily,  our  comparison 
of  ROU  with  ED/EN  seems  to  provide  an  exception  to 
that  rule . 

5.  Acknowledgement 

The  authors  wish  to  thank  Professor  Luc  Devroye 
for  his  helpful  comments  and  encouragement. 

References 

Barbu,  G.  (1983).  On  computer  generation  of 
random  variables  as  a  ratio  of  uniform  random 
variables.  Economic  Computation  and 
Economic  Cybernetics  Studies  and  Research, 
Academy  of  Economic  Studies,  Bucharest,  Vol. 

18.  33-50. 


Box,  G.E.P.  and  Muller,  M.E.  (1958).  A  Note  on 
the  Generation  of  Random  Normal  Deviates. 

Annals  of  Mathematical  Statistics ,  29,  610- 
611. 

Box,  G.E.P.  and  Tiao,  George  C.  (1973). 

Bayesian  Inference  in  Statistical  Analysis, 
Reading,  Mass.;  Addlson-Wesley,  149-202. 

Cheng,  R.C.H.  and  Feast,  G.M.  (1979).  Some 
simple  gamma  variate  generators.  Applied 
Statistics, 28,  290-295. 

Devroye,  L.  (1986).  Non-uniform  random  variate 
generation.  Springer-Verlag:  New  York. 

Johnson,  M.E.  (1979).  Computer  Generation  of  the 
Exponential  Power  Distributions.  Journal  of 
Statistical  Computation  and  Simulation ,  9, 
239-240. 

Johnson,  M.E.,  Tietjen,  G.L.,  and  Beckman,  R.J. 
(1980) .  A  new  family  of  probability 
distributions  with  applications  to  Monte  Carlo 
studies.  Journal  of  the  American 
Statistical  Association ,  75,  276-279. 

Klnderman,  A.J.  and  Monahan, J.F.  (1977).  Computer 
generation  of  random  variables  using  the 
ratio  of  uniform  deviates.  ACM  Transactions 
on  Mathematical  Software,  3,257-260. 

Klnderman,  A.J.  and  Monahan , J . F .  (1979).  New 
methods  for  generating  student’s  t  and  gamma 
variables.  Technical  Report,  Department  of 
Management  Science,  California  State 
University,  Northridge,  CA. . 

Klnderman,  A.J.,  and  Ramage ,  J.G.  (1976). 

Computer  Generation  of  Normal  Random 
Variables .  Journal  of  the  American 
Statistical  Association ,  71,  893-896. 

Robertson,  I.  and  Walls,  L.A.  (1980).  Random 
number  generators  for  the  normal  and  gamma 
distributions  using  the  ratio  of  uniforms 
method.  Technical  Report  AERE-R  10032,  U.K. 
Atomic  Energy  Authority,  Harwell,  Oxfordshire. 

Tadlkamalla,  P.R.  (1980).  Random  sampling  from 
the  exponential  power  distribution.  Journal 
of  the  American  Statistical  Association ,  75, 
683-686. 


629 


AN  APPROACH  FOR  GENERATION  OF  TWO  VARIABLE  SETS  WITH  A  SPECIFIED 
CORRELATION  AND  FIRST  AND  SECOND  SAMPLE  MOMENTS 


Mark  Eakin,  Ph.D.  and  Henry  D.  Crockett,  C.S.P. 


ABSTRACT 

Certain  simulations  require  the  generation 
of  correlated  variables  with  a  prespecified 
first  and  second  moments.  The  first  step 
involved  the  random  generation  of  two 
standardized  variables.  Secondly,  the  first 
variable  was  replaced  by  a  linear  combina¬ 
tion  of  the  two  variables  such  that  the 
coefficient  of  the  linear  combination  and 
the  second  variable.  The  variables  can  then 
be  adjusted  to  give  the  required  first 
second  sample  moments  without  modifying  the 
correlation  equations. 

INTRODUCTION 


The  proof  consists  of  finding  the  value  of  c 
such  that  the  correlation  of  z^  and  (z^  c  + 
z^)  is  r  .  The  values  of  z,  and  (z^  c  +  z^) 
are  then  adjusted  to  give  the  necessary 

means  and  standard  deviations.  The  proof 
starts  by  expressing  the  square  of  the 
correlation  between  z.  and  (z.  c  +  z^)  in 
terms  of  sums  and  products  ana  squares 
(usually  this  is  expressed  in  terms  of 
deviations  from  the  mean  but  both  z^  and 
(z  c  +  z^)  have  mean  zero): 

[  Zj  (Zj  c  +  Z2)/(n-l)]^ 
r^2  =  [  Zj2/(n-l)]  [  (z^  c  +  Z2)^/(n-l)]  (7) 


This  paper  presents  a  way  of  generating  two 
real-valued  variables  that  have  a  fixed 
sample  correlation.  Edwards  (1959)  and 
Searle  and  Flrey  (1980)  discuss  procedures 
to  generate  two  Integer  valued  variables 
that  have  a  specified  correlation.  Both 
procedures  require  several  iterations  in 
order  to  achieve  the  desired  correlation. 
However,  in  large-scale  simulations  the 
Iterative  approach  is  not  efficient, 
Kvalseth  (1979)  developed  a  procedure  to 
generate  a  pair  of  normally  distributed 
variables  that  had  a  specified  sample 
correlation  value. 


The  following  procedure  gives  a  closed-form 
solution  to  the  problem  of  achieving  a  fixed 
sample  correlation  between  two  real  valued" 
variables.  The  two  variables  do  not  have  to 
be  normally  distributed  but  may  have 
prespecified  sample  means  and  variances. 


The  problem:  generate  two  variables,  x  and 
x^ ,  from  samples  of  size  n  such  that  (IJ  the 
mean  of  x^  and  x^  are  and  A,V2, 
respectively;  (2;  the  standard  deviations 
are  s  and  s  ,  respectively,  and  (3)  the 

correlation  between  x,  and  x.  is  r  . 

1  2  X 


The  solution:  (1)  generate  two  variables  Zj 
and  z^  and  standardize  their  values  using  a 
sample  of  size  n;  (2)  calculate  the  correla¬ 
tion,  r^,  between  z^  and  z^;  and  (3)  let 


and 


'2  “  (^1  ^2)  §2  +  ^2 


(1) 

(2) 


where  c 


K 


1 


(-K2  +  (K22-  4  Kj  Kj)  ‘-5)/(2Kj)  (3) 
r  ^  -  1  (4) 

X  '  ' 


and 


r3^) 


1) 


(5) 

(6) 


Multiplying  the  terms  together  in  the 
numerator  and  denominator  of  (7)  and 
recalling  that  the  variance  of  z^  is  one 
gives 

f  (c  Zj  +  /  (n-1)  ] 

^^2  =  [1]  [(c2  zl2  +  c  zlz2  +  z22)/(n-l)]  (8) 

The  following  identities  will  be  substituted 
into  (8): 


43  =  ZjZ2/(n-l) 

’ 

(9) 

N 

II 

,  and 

(10) 

Z22/(n-l)  =  1 

(11) 

obtaining 

[  3 

-  0  _ 

(12) 

X  r  2 

[c 

+  cr  +  1] 

Squaring  the  numerator  of  (12),  multiplying 
both  sides  by  the  denominator,  and  then 
gathering  all  terms  on  the  left  hand  side 
obtains 


(c^  +  2cr,  +  1)  r  c^  2cr,  -  r,^=  0.  (13) 

3  X  3  3 

Rewriting  as  a  quadratic  function  of  a  c 
results  in 


(r^2-l)c+2r3(r^^-l)c+(r^2-r32) 


(14) 


The  solution  to  (14)  can  be  found  using  the 
quadratic  formula  for  the  following 
quadratic  equation 

K|  c  +  K2  c  +  K3  =  0  (15) 
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GAMMA  PROCESSES,  PAIRED  COMPARISONS  AND  RANKING 


Hal  Stern,  Harvard  University 


Introduction 

In  non-parametric  statistical  procedures  the  n  ob¬ 
servations  in  a  sample  are  often  replaced  by  their  ranks 
within  the  sample.  Under  the  null  hypothesis,  the  dis¬ 
tribution  on  the  ranks  is  assumed  to  be  uniform  over 
permutations  of  the  integers  from  1  to  n.  Other  distri¬ 
butions  on  the  permutations  are  of  interest  as  alterna¬ 
tive  descriptions.  Mallows  (1957)  introduces  a  variety 
of  alternative  distributions.  In  this  discussion  we  con¬ 
sider  models  for  rank  data  which  are  derived  by  con¬ 
sidering  permutations  of  gamma  random  variables.  It 
turns  out  that  these  models  include  the  two  most  pop¬ 
ular  ranking  models.  After  some  discussion  motivating 
the  use  of  gamma  random  variables,  the  gamma  models 
are  applied  to  paired  comparisons  experiments.  In  these 
experiments  only  two  of  the  set  of  objects  are  ranked  at 
one  time.  This  simple  case  leads  to  some  theoretical 
results  about  gamma  models.  Finally,  a  data  set  con¬ 
sisting  of  the  results  of  horse  races  is  analyzed  using  the 
methods  described  here.  The  gamma  models  are  used 
to  model  the  observed  distribution  on  permutations. 

Gamma  Comparison  Models 

Suppose  that  k  individuals  are  to  be  ranked  ac¬ 
cording  to  the  waiting  time  for  r  points  to  be  scored. 
If  the  i”'  individual  scores  points  as  a  Poisson  process 
with  parameter  (the  time  between  points  is  an  expo¬ 
nential  random  variable  with  mean  A*')  then  the  time 
until  r  points  are  scored  has  the  gamma  distribution 
with  shape  parameter  r  and  scale  parameter  A^.  By 
also  assuming  that  the  waiting  times  for  the  k  individ¬ 
uals  are  independent  we  can  compute  the  probability 
that  the  k  individuals  are  ranked  in  any  order.  Let 
TT  =  (tt,  , . . . ,  TTfc)  be  a  permutation  of  the  integers  from 
1  through  k  ana  let  Afi , . . . ,  Af*  be  independent  gamma 
random  variables  with  shape  parameter  r  and  different 
scale  pararfteters  Ai,.,.,Ak.  The  probability  that  indi¬ 
vidual  TT,  is  ranked  first,  is  ranked  second,  etc.  is 
given  by  the  k-dimensional  integral 

p'''(7r)  -  Pr(X,,  <  AT,,  <  ■••  <  (1) 

This  heuristic  derivation  is  restricted  to  integer  values 
of  r.  Other  values  of  r  can  also  be  considered  if  the  point 
scoring  process  is  modeled  as  an  independent  increments 
gamma  process.  The  gamma  process  is  a  stochastic  pro¬ 
cess  with  parameter  A,  Gi(r),  such  that  G^  (0)  =  0, 
G*  (rjj-Gi  (r, )  is  independent  of  G*  (r,  )-G*  (r,,)  when¬ 
ever  r„  <  r,  <  Tj  and  Gi(r2)  -  G*(r,)  has  the  gamma 
distribution  with  shape  parameter  -  ri  and  scale  pa¬ 
rameter  A.  Then  the  probability  of  a  particular  permu¬ 
tation  is  defined  for  all  positive  values  of  r. 

Paired  Comparisons 

In  many  experimental  situations  it  is  not  reason¬ 
able  to  rank  more  than  two  objects  at  a  time.  In  rank¬ 
ing  tennis  or  ches-s  players  the  only  observations  are  the 
results  of  matches  between  two  players.  From  these  re¬ 
sults,  we  hope  to  rank  all  of  the  players.  This  is  an 
example  of  a  paired  comparison  experiment.  A  bibli¬ 
ography  of  the  paired  comparison  literature  is  provided 
by  Davidson  and  I’arquhar  (1070). 

Suppose  that  the  k  players  or  objects  to  be  com¬ 
pared  are  identified  by  the  numbers  \  ,...,k.  The  proba¬ 
bility  that  I  is  preferred  to  j  in  a  comparison  is  denoted 


p\'\  Conceptually  is  the  marginal  probability  that  i 

is  ranked  before  j  in  the  distribution  p*''  (rr).  It  can  also 
be  derived  by  considering  a  comparison  of  two  indepen¬ 
dent  gamma  random  variables  with  shape  parameter  r 
and  differing  scale  parameters.  In  this  case 


p;;’  =Pr(x.  <  x,) 


=  /  /  - dx,dx,  2 

y  J  r  r  r  r  .  r  t  ; 


/  /  r(r)rVr 


=/.(A./A.) 


For  fixed  r,  the  probability  that  i  defeats  is  increasing 
in  the  ratio  A, /Ay.  This  is  consistent  with  the  interpre¬ 
tation  of  A,  as  the  rate  at  which  points  are  scored  by 
player  i.  It  is  also  true  that  for  fixed  ratio  A, /Ay  greater 
than  one,  the  probability  that  i  defeats  j  is  increasing  in 
the  parameter  r  which  measures  the  length  of  the  game. 
More  complicated  models  can  be  developed  to  take  into 
account  covariate  information  or  the  possibility  of  ties. 


Examples 

The  parameter  r  determines  the  shape  of  the  gamma 
variables  to  be  compared.  By  considering  specific  val¬ 
ues  of  r  some  natural  models  are  obtained.  If  r  is  equal 
to  one  then  the  probability  that  i  is  preferred  to  j  is 
Ai/(A,  -f-  Ay).  This  is  the  Bradley- Terry  paired  compar¬ 
ison  model  (Bradley  and  Terry  (1952),  Bradley  (1953, 
1954,  1955)).  The  Bradley- Terry  model  has  a  long  his¬ 
tory  of  derivations  and  interpretations  including  Zer- 
melo  (1929)  and  Ford  (1957).  Of  the  many  alternative 
derivations  it  is  important  to  mention  the  convolution 
type  linear  model  approach  discussed  by  David  (1963), 
Latta  (1979)  and  Bradley  (1953).  If  player  fs  score  has 

the  extreme  value  distribution  with  location  parameter 
In  A,  and  player /s  score  has  the  extreme  value  distribu¬ 
tion  with  location  parameter  In  Aj  then  the  probability 
that  I  defeats  j  is  given  by  the  Bradley- Terry  model. 

It  turns  out  that  for  any  value  of  r  there  is  a  convolu¬ 
tion  type  linear  model  which  is  equivalent  to  the  gamma 
paired  comparison  model  (Stern  1987).  For  other  val¬ 
ues  of  r  the  extreme  value  distribution  is  replaced  by  a 
different  translation  family  of  densities. 

Other  integer  values  of  r  can  be  easily  interpreted 
in  terms  of  the  Poisson  point  scoring  model.  When  r 
equals  two,  the  probability  that  i  defeats  j  is  the  prob¬ 
ability  that  player  i  scores  two  points  before  player  j 
docs.  This  can  be  computed  directly  from  gamma  ran¬ 
dom  variables  with  shape  parameter  2  or  indirectly  as 
a  sequence  of  comparisons  with  r-l.  Nonintegcr  values 
may  also  be  considered.  They  arc  included  in  this  dis- 
cii.ssion  by  virtue  of  the  independent  inrrements  Gamma 
process  described  earlier. 

When  large  values  of  r  arc  considered,  the  gamma 
paired  comparison  model  tends  to  the  Thurstonc- 
Mosteller  model  (Thurstone  1927,  Mosteller  1951).  Thur- 
stone  (1927)  assumes  that  comparisons  between  two  ob¬ 
jects  arc  determined  by  comparisons  of  two  normally 
distributed  random  variables.  Five  different  models  are 
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derived  by  making  different  assumptions  about  the  joint 
distribution  of  the  normal  random  variables.  Mosteller 
(1951)  discusses  various  properties  of  Thurstone’s  model 
V  in  which  the  normal  random  variables  are  assumed 
to  have  equal  variamces.  The  distribution  of  the  stan¬ 
dardized  gamma  random  variable  with  shape  parame¬ 
ter  r  and  scale  parameter  A  tends  to  a  standard  nor¬ 
mal  distribution  as  r  gets  large.  Thus  comparisons  be¬ 
tween  gamma  random  variables  lead  to  the  Thurstone- 
Mosteller  model  for  large  values  of  r.  The  gamma  model 
is  again  found  to  be  equivalent  to  a  convolution  type  lin¬ 
ear  model.  More  details  of  the  relationship  between  the 
Thurstone-Mosteller  model  and  the  gamma  model  with 
large  r  are  found  in  Stern  (1987). 

As  a  l2ist  special  case,  consider  comparisons  of 
gamma  random  variables  with  shape  parameter  near 
zero.  The  distribution  of  the  logarithm  of  such  a  gamma 
random  variable  tends  to  the  exponential  distribution. 
Thus  a  paired  comparison  between  two  such  gamma 
random  variables  is  equivalent  to  a  comparison  ot  two 
t  xponential  random  variables  with  different  location  pa¬ 
rameters. 

Inferences  in  Paired  Comparisons 

Given  the  results  of  a  series  of  comparisons  involv¬ 
ing  k  objects,  the  statistical  experimenter  would  like  to 
predict  future  comparisons  or  find  the  optimal  ranking 
of  the  k  objects.  In  an  experiment  using  gamma  ran¬ 
dom  variables  with  shape  parameter  r  and  scale  param¬ 
eters  Ai , ...,  Ak ,  estimates  of  the  parameters  are  required 
and  goodness  of  fit  tests  can  then  be  used  to  determine 
whether  the  model  is  appropriate.  The  usual  formula¬ 
tion  of  the  experiment  treats  the  series  of  comparisons 
as  independent  binomial  trials.  If  r  and  Ai,...,Ak  are 
considered  fixed  then  the  probability  that  object  i  is 

preferred  to  object  j  is  pj'’  =  /r(Ai/A,).  Let  n,j  be  the 
number  of  comparisons  of  objects  i  and  j  and  let  Oij  be 
the  number  of  times  that  i  is  preferred  to  j.  Then  the 
likelihood  is 

1= 1 J  >  •  ^ ^ 


The  shape  parameter  r  may  be  considered  fixed  or 
treated  as  a  parameter  to  be  estimated.  In  the  for¬ 
mer  case  r  determines  the  nature  of  the  comparison  and 
may  be  chosen  before  the  experiment  is  carried  out.  The 
maximum  likelihood  estimates  for  A, , ...,  Ak  may  be  de¬ 
termined  using  ordinary  numerical  algorithms  when  r 
is  known.  Ford  (1957)  describes  a  procedure  when  r  is 
equal  to  one.  The  asymptotic  normality  of  the  maxi¬ 
mum  likelihood  estimates  follows  from  the  usual  maxi¬ 
mum  likelihood  theory  (Lehman  1983).  The  calculation 
of  estimates  is  more  complicated  when  r  is  treated  as  a 
parameter  to  be  estimated.  Typically  several  values  of  r 
arc  considered  and  the  r  which  achieves  the  largest  max¬ 
imum  value  for  the  likelihood  is  the  estimate.  Prior  be¬ 
liefs  can  be  incorporated  in  a  Bayesian  analysis  (David¬ 
son  and  Solomon  (1973),  Leonard  (1977),  Stern  (1987)). 

In  applications  it  is  often  desired  to  compare  the  fit 
of  a  variety  of  models.  Here  we  would  like  to  compare 
the  fit  for  several  values  of  the  shape  parameter  r.  The 
usual  log-likelihood  ratio  statistic  is 


.  =  I  ;  /  • 


a,, In,, 
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(-I) 


where  Ai,...,Ak  are  the  m20cimum  likelihood  estimates. 
For  large  samples  Q  has  a  chi-square  distribution  with 
{k—  l)(lk— 2) /2  degrees  of  freedom.  An  alternative  good¬ 
ness  of  fit  procedure  is  described  by  Mosteller  (1951). 

Applications  to  Data 

In  the  1986  National  League  baseball  season  each 
National  League  baseball  team  played  between  eleven 
and  eighteen  games  against  each  of  the  other  eleven 

teams.  The  results  are  stored  in  the  following  matrix 
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where  the  i,  element  is  the  number  of  times  that 
team  «  defeated  team  j.  The  gamma  paired  comparison 
model  was  fit  to  this  data  (Stern  1987)  with  r=0.1,  0.5, 
1,2,3,5,10,  20.  In  each  case  the  goodness  of  fit  statis¬ 
tic  indicates  an  adequate  fit.  Also,  the  predicted  results 

®»j  =  «.;y/r  differ  by  at  most  0.1  over  the  range  of 

r’s  considered.  These  results  are  consistent  with  the  re¬ 
sults  obtained  from  other  baseball,  basketball  and  foot¬ 
ball  seasons.  Simulations  also  indicate  that  models  with 
different  values  of  r  lead  to  similar  fits  unless  the  sam¬ 
ple  size  is  extremely  large.  This  is  consistent  with  the 
observations  of  Latta  (1979),  Burke  and  Zinnes  (1965), 
and  Jackson  and  Fleckenstein  (1957).  Each  of  these  au¬ 
thors  found  that  the  Bradley-Terry  model  (r=l)  and 
the  Thurstone-Mosteller  model  (r  large)  lead  to  similar 
fits  for  a  given  data  set. 

Why  do  all  paired  comparisons  models  lead  to  sim¬ 
ilar  fits  for  moderate  sample  sizes?  The  answer  to  this 
question  can  be  determined  using  the  triples  function 
or  composition  rule  of  a  model.  The  composition  rule  is 
the  formula  that  is  used  to  determine  p.k ,  the  probabil¬ 
ity  that  i  beats  k,  from  p,,  and  p,k-  Each  model  has  a 
different  value  p.k  for  a  particular  pair  of  values  of  p,, 
and  pyk .  However  it  turns  out  that  the  different  values 
are  quite  close.  Thus  large  numbers  of  comparisons  of  i 
and  k  are  required  in  order  to  distinguish  between  the 
models.  Simple  calculations  (Stern  1987)  indicate  that 
500  or  more  comparisons  of  each  pair  of  teams  are  re¬ 
quired. 

Ranking 

Given  the  similarity  of  the  gamma  paired  compari¬ 
son  models  for  a  wide  range  of  r,  it  is  natural  to  wonder 
if  the  models  corresponding  to  different  values  of  r  have 
different  properties  in  the  general  ranking  problem.  Re¬ 
call  that  the  gamma  model  with  parameter  r  for  permu¬ 
tations  of  k  objects  is  obtained  by  taking  the  probabil¬ 
ity  of  the  permutation  rr  to  be  equal  to  the  probability 
that  k  independent  gamma  random  variables  with  shape 
parameter  r  and  different  scale  parameters  are  ranked 
according  to  the  permutation  tt.  The  probability  of  the 
permutation  n  (rr, , . .  . ,  TTk ),  in  which  tt,  is  the  index  of 
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the  object  with  rank  i,  is  denoted  by  p'''’(7r).  Marginals 
of  this  distribution  are  represented  by  identifying  the 
particular  event  whose  probability  is  being  described. 
Thus  we  write  for  the  probability  that  object  i 

is  ranked  first  (tt,  =  j)  and  (ij)  for  the  probability 
that  I  is  ranked  first  and  j  is  ranked  second.  We  can  take 
7r~  * »  to  be  the  rank  of  object  «,  that  is  to  say  tt"  *  t  =  j 
if  and  only  if  tt,  =  «  or  equivalently  object  i  hza  rank 
j.  Then  p*''>(7r'‘j  <  is  the  probability  that  i  is 

ranked  ahead  of  j. 

Again  the  case  r  equal  to  one  corresponds  to  the 
most  commonly  used  ranking  model.  It  represents  the 
natural  generalization  of  the  Bradley-Terry  model  to 
more  than  two  objects  (Bradley,  1965).  Let 
be  the  scale  parameters  associated  with  the  k  objects. 
Then  p'  ‘ '  (tt)  = 

•^»l  ^”1  ^».-i  ■^ir> 

k  k  k  k  X  '  '  ^ 

E  A.,  E  A..  E  A.,  E  A,. 

1  i=2  .=3  i=k- I 


This  formula  has  a  natural  interpretation  in  terms  of 
a  sequential  ranking  procedure.  The  probability  that 
object  TTj  is  ranked  first  according  to  the  gamma  model 
with  shape  parameter  one  is  equal  to  the  probability 
that  an  exponential  random  variable  with  parameter 
A,,  (mean  A;_‘)  is  the  smallest  of  k  exponentials  with 
parameters  Ai,...Ak.  This  is  precisely  the  first  factor  of 
p'‘'(7r).  The  second  factor  is  the  probability  that  ob¬ 
ject  JTj  is  ranked  first  in  a  comparison  of  the  remaining 
k-l  objects  (those  not  ranked  first).  The  marginal  prob¬ 
abilities  are  easy  to  specify  due  to  this  property.  For 
example 
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El=l  Am(El  =  l  Am  -  A. 


Henery  (1981)  discusses  the  derivation  of  (5)  using  ex¬ 
ponential  random  variables  as  we  have  described  here. 

This  model  also  hsis  the  property  that  the  conditional 
probability  of  some  events  can  be  written  in  the  same 
form  as  p''’  (tt).  Suppose  that  we  condition  on  the  event 
that  object  »  is  ranked  first,  then 


(ttItt, 


=  0  = 


p'‘’(^) 

p'  ‘  >  (»■) 


A.. 


(6) 


is  precisely  the  probability  that  the  jfc-1  remaining  ob¬ 
jects  are  ranked  according  to  the  permutation  k  = 
(ttj  ,  ■  •  ■  ,7rt).  Harville  (1973)  proposed  the  formula  (5) 
based  on  this  property. 

Gamma  models  for  distributions  of  permutations 
with  shape  parameters  other  than  one  are  difficult  to 
work  with  because  there  are  no  simple  expressions  for 
The  cases  in  which  r  tends  to  oo  and  r  tends 
to  zero  can  be  analyzed  by  considering  the  equivalent 
translation  family  models.  Thus  as  r  tends  to  infin¬ 
ity  the  gamma  model  resembles  the  extension  of  the 
Thurstone-Mostcller  model  proposed  by  Daniels  (1950). 

As  a  last  example  of  gamma  ranking  models  we  con¬ 
sider  integer  values  of  r  greater  than  one,  particularly 


r  equal  to  two.  The  probability  of  the  permutation 
under  the  gamma  model  with  integer  shape  parameter 
r  can  be  calculated  using  a  counting  argument.  To  de¬ 
scribe  this  argumv^nt  suppose  that  there  are  k  players 
each  attempting  to  score  r  points.  Each  player  scores 
points  as  a  Poisson  process  and  the  k  processes  are  in¬ 
dependent.  Player  »  scores  points  at  rate  A,.  At  first  all 
k  players  are  attempting  to  score  points  simultaneously. 
This  can  be  viewed  as  a  combined  Poisson  process  with 

rate  E.  =  i  A;.  The  probability  that  a  point  in  the  com¬ 
bined  process  is  scored  by  player  i  is  proportional  to 
A,.  Successive  points  are  scored  independently  due  to 
the  Poisson  processes  involved.  When  the  first  player 
has  accumulated  r  points,  the  corresponding  process  is 
removed  from  the  combined  process.  Now  ifc-1  players 
compete  simultaneously.  This  counting  argument  leads 
to  a  complicated  expression  for  p'''’(z-)  (Stern  1987).  A 
similar  expression  is  obtained  by  Henery  (1983).  As  a 
particular  example  if  k=3  and  r=2  then  p'^'(rr)  = 


A"  A^  ( 
»1  \ 
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where  A,,,,  =  A,,  -f-  A,,  and  without  loss  of  general¬ 
ity  we  have  set  A,,  +  A,,  -i-  A,,  =  1.  The  justifica¬ 
tion  for  introducing  this  constraint  appears  in  the  next 
section.  These  models  no  longer  have  the  sequential 
or  conditional  properties  described  previously  for  the 
model  with  r  equal  to  one.  However,  applications  are 

given  later  in  which  they  lead  to  better  fits. 


Inference  in  Ranking  Models 

In  a  sample  of  n  random  permutations  we  denote 
the  empirical  distribution  by  p„(?r).  The  log  likelihood 
under  the  gamma  model  with  shape  parameter  r  is 

^np„(7r)  In  p'’''(7r)  +  C  (8) 


where  C  is  a  constant  that  does  not  depend  on  r  or 
the  parameters  A, , . . . ,  A* .  For  large  k  it  is  usually 
not  feasible  to  record  the  entire  empirical  distribution. 
Instead  several  marginals  of  the  empirical  distribution 
are  recorded.  For  example  we  may  only  know  p„(l), 
...,p„(k),  the  frequencies  with  which  each  object  is 
ranked  first  in  the  data  set.  Estimates  of  A,,..., A* 
are  computed  using  maximum  likelihood  methods.  Of¬ 
ten  the  E-M  algorithm  (Dempster,  Laird,  Rubin  (1977)) 
can  be  used. 

Maximum  likelihood  estimation  is  particularly 
straightforward  when  the  empirical  distribution  is  known 
for  k  margineJs  which  are  mutually  exclusive  and  which 
exhaust  the  possible  permutations.  The  example  cited 
above  would  be  such  a  case.  In  this  case  the  likelihood 
equations  reduce  to  a  particularly  simple  form.  Esti¬ 
mates  are  obtained  by  solving  the  system  of  equations 
obtained  by  setting  each  empirical  marginal  probability 
equal  to  the  marginal  probability  expressed  in  terms  of 
the  parameters.  We  consider  two  examples  which  are 
then  used  in  an  application. 

In  both  examples,  the  empirical  probabilities 
p„(l),  ■  ■  ■  ,Pn(k)  are  assumed  known,  for  the  gamma 
model  with  shape  parameter  one  the  maximum  likeli¬ 
hood  estimates  for  A, ,  •  •  ■  An  are  obtained  by  setting  the 
theoretical  probability  equal  to  the  empirical  probabil¬ 
ity 

k 

A./ H  A,  =  p„(t)  1  =  1 . k.  (9) 

j  =  i 
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It  is  easily  seen  that  the  A’s  are  not  uniquely  determined. 
This  is  true  of  the  entire  model;  the  scale  parameters 
can  be  multiplied  by  a  positive  constant  and  the  proba¬ 
bilities  II),  (5)-(7)  and  the  likelihood  (8)  are  unchanged. 
Typically  the  parameters  are  chosen  to  satisfy  the  con¬ 
straint  1  A,  =  1.  In  this  Ccise  the  maximum  likeli¬ 
hood  estimates  are  A;  =  p„(0-  If  fh®  shape  parameter 
is  two,  then  the  system  of  equations 


it  is  not  possible  to  do  much  better  than  p,  (t)  as  an 
estimate  of  the  probability  that  horse  t  wins. 

As  in  Harville  (1973)  and  Ziemba  and  Hausch,  we 
use  p,  (f)  as  the  empirical  probability  although  it  is  actu¬ 
ally  the  public’s  estimate  of  the  probability  that  i  wins. 
Then  p,  (t)  and  the  gamma  model  with  shape  parameter 
one  are  used  to  estimate  the  probability  that  horse  t  fin¬ 
ishes  second.  From  the  previous  section,  the  maximum 
likelihood  estimates  are 


k 

\^{k\  []A,  +-..  +  2!  Y,  +1)  =P"(‘)  (10) 

j  =  1, 

must  be  solved.  An  iterative  algorithm  is  required  to 
solve  this  system. 

The  Horse  Race  I  roblem 

The  problem  in  which  the  marginal  probability  of 
finishing  first  is  known  for  each  object  can  be  called  the 
horse  race  problem  since  this  is  approximately  the  case 
at  the  racetrack.  The  argument  supporting  this  state¬ 
ment  is  described  in  more  detail  later.  The  horse  race 
problem  is  studied  extensively  by  Ziemba  and  Hausch 
(1987).  Fither  the  true  probability  that  horse  t  wins  or 
the  empirical  probability  that  horse  i  wins  is  assumed  to 
be  known  for  each  of  the  k  horses.  Ziemba  and  Hausch 
then  use  the  gamma  model  with  shape  parameter  one 
to  estimate  the  probability  that  each  horse  finishes  sec¬ 
ond  or  third.  They  use  these  estimates  to  compute  the 
expected  return  on  place  and  show  bets.  A  place  bet  op 
horse  i  is  won  if  horse  «  finishes  first  or  second.  A  show 
bet  is  won  if  the  horse  finishes  third  or  higher.  Here 
we  compare  the  estimates  provided  by  the  model  with 
r  equal  to  one  and  the  model  with  r  equal  to  two. 

The  results  of  horse  races  at  Bay  Meadows  race¬ 
track  in  California  during  January  and  February  1987 
were  collected  from  the  newspaper.  For  each  race  the 
track  odds  and  the  result  of  the  race  were  recorded. 
The  odds  are  determined  by  the  amount  of  money  bet 
on  each  horse  to  win  the  race.  Approximately  20%  of 
the  wagered  money  is  kept  by  the  track  in  the  form  of 
the  track  take  and  breakage  (rounding  off  the  odds). 
The  remaining  money  is  split  among  those  who  bet  on 
the  winning  horse  in  proportion  to  the  size  of  their  bet. 
Suppose  that  there  are  k  horses  and  the  odds  for  horse 
«  are  O, :  1.  A  $2  bet  on  horse  t  to  win  the  race  returns 
$  20,  -)  2  if  horse  i  wins  the  race  and  nothing  if  horse 
I  loses.  Let  p,  (i)  be  the  proportion  of  the  total  number 
of  dollars  bet  on  horses  to  win  the  race  which  is  bet  on 
horse  i  to  win  the  race.  Without  the  track  take,  p,(i) 
and  the  odds  O,  would  be  related  by  p,  (i)  =  (O,  +  1)  ’  ‘. 
Since  the  reported  odds  are  adjusted  for  the  track  take 
we  take 
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The  collection  p,  (:),i  l,...,k  represents  the  prob¬ 

ability  estimate  of  all  the  bettors  considered  together. 
Ziemba  and  Hausch  study  the  estimates  Pi(i)  ai.d  find 
that  they  provide  good  estimates  of  the  probability  that 
horse  i  wins  the  race.  They  find  that  horses  with  high 
probability  of  winning  win  more  often  than  predicted  by 

the  bettors  and  horses  with  low  probability  of  winning 
win  less  often.  People  arc  attracted  by  the  potentially 
large  payoff  in  the  latter  case.  Despite  this  inaccuracy. 


=P<(«).  i=l,...,k. 


Let  p*‘*(-  «)  be  the  estimated  probability  that  i  finishes 
second  under  this  model.  Then  according  to  the  Harville 
formulas  (the  gamma  model  with  shape  parameter  one). 


pl‘)(. 


0  =  E 


1  -  A, 


Ep‘(j') 


Pi  (0 

1  -  Pt  (j) ' 


As  an  alternative  we  fit  the  gamma  model  with  shape 
parameter  two  to  the  same  data.  We  consider  p,  (i)  fixed 
and  use  an  iterative  procedure  to  solve  the  likelihood 
equations  (10)  for  the  maximum  likelihood  estimates. 
The  estimated  probability  that  t  finishes  second  accord¬ 
ing  to  this  gamma  model  is  computed  by  summing  the 
estimated  probability  of  all  permutations  in  which  t  fin¬ 
ishes  second.  Let  the  estimates  from  this  gamma  model 
be  p'^’('  i).  The  estimates  for  both  models  computed 
for  a  sample  race  are: 


Sample  Horse  Race  Data 
2'”'  Race,  January  9,1987 


Horse 

0, 

P.(») 

?'■’(•>) 

p'^i(.  ,• 

1 

8.9 

0.084 

0.107 

0.113 

2 

19.8 

0.040 

0.053 

0.059 

3 

1.1 

0.395 

0.280 

0  266 

4 

3.3 

0.193 

0.217 

0.214 

5 

6.2 

0.134 

0.162 

0.165 

6 

4.4 

0.154 

0.182 

0.183 

Those  horses  with  large  probabilities  of  winning  have  a 
lower  probability  of  finishing  second  when  r  equals  two. 

Due  to  the  complicated  calculation  required  to  com¬ 
pute  the  estimates  when  the  shape  parameter  is  two, 
we  restrict  attention  to  47  races  in  which  there  were 

6  horses.  For  each  of  the  47  races,  estimates  of  the 
probability  that  each  horse  finishes  second  are  available 
under  two  different  models.  The  true  probability  that 
each  horse  finishes  second  is  unknown,  only  the  iden¬ 
tity  of  the  actual  second  place  horse  is  available.  The 
number  of  times  that  a  horse  with  a  given  estimated 
probability  of  finishing  second  actually  finished  second 
is  compared  to  the  expected  number  below.  There  are  a 
total  of  282  horses  in  the  47  races.  Horses  with  similar 
estimated  probabilities  of  finishing  second  are  grouped 
together.  The  expected  number  of  second  place  finishes 
for  a  particular  group  is  computed  as  the  sum  of  the 
estimated  probabilities  for  the  hors<’s  in  that  group. 
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Expected  and  Observed  Second  Place  Finishes 


p 

r  =  1 

Observed  Expected 

r  =  2 

Observed  Expected 

0.00-0.10 

6 

4.15 

4 

3.75 

0.10-0.15 

7 

7.69 

9 

7.94 

0.15-0.20 

13 

8.56 

13 

9.97 

0.20-0.25 

12 

12.54 

13 

14,10 

>0.25 

9 

14.06 

8 

11.23 

Note  that  the  observed  second  place  horse  is  put  into 
a  group  based  on  the  estimated  probability  of  its  fin¬ 
ishing  second.  The  same  second  place  horse  may  be  in 
different  group®  under  the  two  gamma  models.  In  the 
first  pair  of  columns  we  observe  results  similar  to  those 
of  Harville  (1973).  Horses  which  have  high  predicted 
probability  of  finishing  second  finish  second  less  often 
than  predicted.  Horses  with  low  estimated  probability 
of  finishing  second  do  better  than  expected.  By  taking 
r  equal  to  two,  the  fit  in  these  two  areas  is  improved.  It 
seems  that  higher  values  of  r  might  lead  to  better  fits 
but  this  has  not  been  verified. 

Summary 

Comparisons  of  gamma  random  variables  with  a 
given  shape  parameter  provide  a  family  of  distributions 
for  permutations  of  k  objects.  VVhen  these  models  are 
restricted  to  the  paired  comparison  experiment  they 
provide  alternative  derivations  of  the  currently  used 
models.  These  models  are  then  naturally  generalized 
to  the  problem  in  which  k  objects  are  compared  at  a 
time.  There  arc  indications  that  largo  sample  sizes  are 
needed  to  distinguish  between  the  models  which  cor¬ 
respond  to  different  shape  parameters.  However  there 
are  also  indications  that  models  with  shape  parameter 
other  than  one  can  be  successfully  applied  in  practical 
probleras.  In  order  to  make  tfie  models  more  widely 
applicable  improved  methods  for  computing  the  distri¬ 
butions  are  recpiired.  It  would  be  particularly  useful  to 
find  approximations  which  can  fie  easily  computed. 
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A  MODULAR  NONPARAMETRIC  APPROACH  TO  MODEL  SELECTION 
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A  two  stage  approach  is  introduced  which  allows  a  researcher  to  choose 
from  among  ten  alternative  families  of  models  for  conditional 
"slices"  and  individual  component  "sections,"  of  a  mixture  of  marginal 
or  conditional  densities.  The  general  concept  of  logmodel  is 
introduced  and  it  is  considered  that  the  pair  of  models,  normal- 
lognormal  is  only  one  of  at  least  five  classes  of  general  logmodel 
systems.  A  functional,  referred  to  as  R(x) ,  is  introduced  which  allows 
one  to  determine  model-class  membership  without  the  need  to  conduct 
multiple  trials  with  arbitrarily  selected  pretransformation  location 
parameters,  e.g.,  the  constant  C  of  the  transform  Log(Y  -  C) .  The  R(x) 
functional  is  applied  to  the  problem  of  parameter  estimation  for  a 
system  of  models  which  is  the  dual  of  the  Johnson  family  of  models. 
Conditions  for  the  existence  of  the  above  type  of  functional  are 
derived  and  an  example  of  a  model  is  given  for  which  it  is  shown  that 
no  functional  exists  which  has  the  properties  of  R(x)  constructed  <^rom 
logmodel  systems. 
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Lognormal;  Logmodel;  Loglogmodel;  Conditional  estimation;  Mixture 
decomposition;  Transformations;  Model-free  methods. 


0.  Introduction:  It  is  common  in  the 
early  stages  of  a  statistical 
investigation  to  construct  a  histogram 
or  scatter  diagram.  This  process  is 
usually  performed  for  the  purpose  of 
checking  the  assumptions  upon  which 
later  stages  of  the  data  analysis 
process  will  be  based.  No*-e  the 
important  fact,  that  when  the 
statistician  plots  a  histogram,  it  is 
usually  the  nature  of  a  marginal  and  not 
a  conditional  density  that  is  being 
checked.  Hov/ever,  it  is  more  often  a 
conditional  density,  component  of  a 
mixture  of  marginal  densities  or  even  a 
mixture  of  conditional  densities  that  is 
the  most  appropriate  target  of  the  model 
selection  process. 

Consider,  tor  example,  the  simple  case 
of  linear  regression  where  the 
observations  are  divided  into  two  sets 
of  repl icatcs ,  i . e  .  , 

E(Yij)  =  a  ^  pXi,  i=l,2;  j=l,..,ni, 
v;hore  is  not  equal  to  Xo  and  p,  n^ 
and  Ut  are  nonzero.  Even  in  the  commonly 
assumed  situation  where  for  a  given 
value  of  Xj  or  X.,  the  variate  Yy  is 
normally  distributed,  the  marqinar  of 
the  population  of  all  Y' •  variates  will 
(le  a  mixture  of  normals,  itonce,  in  order 
to  check  assumption;;  about  Y  j ,  the 
researcher  faces  the  task  of 
characterizing  a  multicomponent 
distribution.  In  the  above  scenario  the 
researcher  may  be  aiile  to  circumvent  the 


difficulty  of  dealing  with  a  mixture  of 
distributions  by  examining  the 
conditional  distributions  of  Y  given 
and  Y  given  X2 ,  for  example  by 
constructing  two  histograms.  However,  in 
many  realistic  data  analysis  situations 
one  will  not  initially  be  aware  of  the 
existence  of  a  variate  whose  values, 
like  X^  =  Xj^  and  X^  =  X2  index 
distributional  identity.  Even  in  the 
situation  v/here  the  existence  of  such  a 
variate  is  known,  unless  large  numbers 
of  replicates  are  available  at 
particular  values  of  this  variate,  the 
investigator  must  rely  on  the  analysis 
of  regression  residuals  to  check 
assumptions  about  the  underlying 
distribution.  One  trouble  with  this 
approach  is,  that  in  order  to  analyze 
the  distribution  of  residuals  about  a 
regression  line,  an  investigator  must 
rely  on  some  simple  and  usually  linear 
assumption  about  the  functional  E(Y|x). 
Useful  though  it  is  in  some  situations, 
residual  analysis  often  leaves  one 
uncertain  as  to  whether  it  is  the  model 
for  K(Yix)  itself,  or  the  model  for  the 
distribution  of  the  residuals  from 
E(Y;x)  or  perhaps  both  models  that  are 
suspect . 

In  .addition  to  this  difficulty,  .another 
problem  is  sometimes  encountered  in 
conventional  regression  analysis.  As 
illusfr.ite'i  in  Tarter,  I’ollisar  .and 
brofr.in  flb'k!)  the  relationship  between 
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two  variates  may  not  only  be  nonlinear, 
but  a  single-valued  function  which 
adequately  describes  this  relationship 
may  not  even  exist.  In  view  of  the 
substantial  interest  in  mixture 
decomposition  and  cluster  analysis 
techniques,  it  seems  understandable  that 
any  study  based  solely  on  an  analysis  of 
the  residuals  from  a  single-valued  locus 
of  E(Y|x)  evaluations  may  be  overly 
simplistic. 

The  methodology  described  in  this  paper 
seeks  to  overcome  the  difficulties 
inherent  in  the  use  of  residual  analysis 
to  characterize  the  underlying 
distribution,  as  well  as  to  investigate 
the  possible  existence  of  unanticipated 
independent  variates.  A  systematic 
process  allowing  the  researcher  to 
select  an  appropriate  model  or 
transformation  for  his  or  her  data  is 
proposed . 

The  methods  detailed  in  sections  3  and  4 
are  all  two  step  processes.  The  first 
step  always  consists  of  estimating  the 
entire  joint  distribution  of  the 
observations  in  as  general  a  manner  as 
is  possible,  given  the  amount  of 
available  data.  Since  all  properties, 
curves  and  functionals  of  statistical 
interest  can  be  defined  in  terms  of  the 
joint  distribution  of  all  variates,  once 
the  underlying  distribution  function  is 
estimated,  one  can  proceed  to  the  second 

step  of  obtaining  what  can  be  referred 
to  as  "secondary"  estimators. 

The  problem  targeted  by  this  paper  is; 
How  can  secondary  estimators  be  used  to 
select  from  among  a  wide  spectrum  of 
distributional  models?  In  particular, 
one  would  like  the  search  for  an  answer 
to  this  question  to  proceed  in  a 
systematic  fashion  wtiich  takes  advantage 
of  the  interrelationships  between 
certain  general  classes  of  statistical 
models . 

1. Models  and  Loomodels:  One  basic 
characteristic  differentiates  the 
proposed  model  selection  process  from 
previously  considered  procedures  such  as 
those  treated  extensively  by  Johnson 
(1949).  As  detailed  in  section  4,  the 
logarithmic  data  transformation  will  be 
a  common  element.  Thus,  the  new 
methodology  can  be  interpreted  as  being 
the  dual  of  the  Johnson  approach  in  the 
sense  that  one  commonly  used  data 
transformation  will  be  associated  with  a 
spectrum  of  models.  (The  Johnson 
approach  can  be  characterized  as  the 
consideration  of  a  variety  of 
transformations  to  yield  data  which 
conforms  to  a  single  model,  the  normal.) 
,-.s  suggested  by  Thompson  (1988),  the 
Johnson  "family"  was  proposed  at  a  time 
when  the  normal  model  held  a  much  more 
prominent  position  in  the  pantheon  of 
underlying  distributions  than  it  does 


today.  Many  of  the  applied  scientists 
with  whom  the  authors  work  make 
extensive  use  of  the  logarithmic 
transformation.  Thus,  while  we  feel  that 
neither  the  logarithm  or  the  normal 
distribution  is  in  any  sense  sacred,  we 
do  feel  that  there  is  as  much 
justification  for  the  selection  of  a 
single  transformation  as  a  common 
element  as  there  is  for  selecting  a 
single  model. 

As  an  example  of  this  approach  consider, 
as  will  be  detailed  in  section  4,  that 
one  can  interpret  the  relationship 
between  the  standard  uniform  or 
rectangular  density  model  and  the 
exponential  model,  to  be  analogous  to 
the  relationship  between  the  lognormal 
and  the  normal  model.  The  negative 
logarithm  of  a  standard  uniform  variate 
will  yield  a  standard  exponential 
variate  in  exactly  the  same  way  that  the 
logarithm  of  a  lognormally  distributed 
variate  will  yield  a  normal  variate. 
Thus,  the  details  of  a  process  by  which 
one  can  conduct  a  search  for  the  most 
appropriate  model  will  be  given  in 
section  4.  This  process  takes  advantage 
of  the  fact  that  not  only  the 
rectangular  distribution,  but  also  the 
exponential  distribution,  can  be  treated 
as  a  special  case  of  the  general  power 
distribution  family  of  models. 

There  are  several  other  commonly 
encountered  pairs  of  models  which  have 
the  above  type  of  relationship.  The 
logarithm  of  a  Weibull  variate  minus  a 
constant  is  known  to  be  an  extreme  value 
variate  (Johnson  and  Kotz,  1970,  page 
272).  There  has  also  been  some 
consideration  of  what  could  be  called  a 
loglogistic  model  and,  as  will  be  shown 
below,  there  is  no  reason  why  one  cannot 
conceptualize  a  logcauchy  model. 

The  lognormal  is  obviously  by  far  the 
most  commonly  assumed  of  the  logmodels. 
It  is  also  very  often  used 
inappropriately  due  to  the  need  in  many 
situations,  to  estimate  an  appropriate 
constant  for  subtraction  from  the 
pretransformed  variate  as  a  preliminary 
to  calls  to  the  log  function.  It  will  be 
demonstrated  below  that  this  constant 
plays  an  important  role  in  other 
logmodels  and  thus,  for  the  purpose  of 
ease  of  reference,  this  constant  will  be 
called  a  "pretransformation  location 
parameter",  symbolized  by  For 
completeness,  once  the  exact  value  of 
the  pretransformation  location  parameter 
has  been  subtracted  from  the  lognormally 
distributed  variate,  the  natural  log  of 
the  difference  will  be  said  to  be 
distributed  with  scale  paraineter  o  and 
"posttransformation  location  parameter" 
^2  ■ 

A  common  graphical  procedure  which  is 
often  used  to  check  on  the  validity  of 


641 


the  lognormality  assumption,  is  the 
lognormal  plot.  Even  after  the  advent 
of  sophisticated  personal  computer 
graphical  packages,  one  still  might  find 
it  useful  to  hand-plot  an  estimated  cdf 
cn  graph  paper  whose  abscissa  is 
graduated  by  a  log  scale  and  whose 
ordinate  is  graduated  on  a  standard 
normal  cumulative  scale.  An  example  of 
such  paper  is  given  in  Dixon  and  Massey 
(1983)  page  488. 

One  could  select  the  particular 
functional  associated  with  the  lognormal 
plot  as  what  was  previously  referred  to 
as  a  "secondary  estimator."  However, 
when  implemented  in  the  usual  way,  there 
is  an  important  weakness  to  this 
approach  which,  as  will  be  shown  below, 
is  not  associated  with  a  second,  and 
closely  related  functional. 

Consider  that  the  lognormal  plot 
utilizes  the  human  visual  system's 
sensitivity  to  the  straightness  or  lack 
of  straightness  of  a  line  and  that,  as 
is  well  known,  in  two  dimensions,  a  line 
is  characterized  by  two  numbers,  for 
example  a  slope  a  and  a  y-intercept 

p.  The  lognormal  model  is  characterized 
by  three  numbers;  a  scale  parameter,  and 
two  location  parameters,  and  10.2-  For 
the  following  reason  the 

posttransformation  location  parameter  ^2 
is  much  easier  to  estimate  than  the 
pretransformation  location  parameter 

In  the  case  of  lognormal  data,  once 
is  determined,  one  can  take  the 

logarithm  of  the  underlying  variate 
minus  this  known  value  and  be  certain 
that  the  resulting  variate  will  have  a 
normal  distribution  (or  in  the  case  of 
the  four  other  distribution  systems 
considered  below,  an  exponential, 
extreme  value,  logistic  or  Cauchy 

distribution).  In  other  words,  once 
one's  data  have  been  transformed  to 
normality,  the  estimation  of  the 
posttransformation  location  parameter  ^2 
has  been  reduced  to  the  extremely  well 
studied  problem  of  estimating  the  mean 
of  a  normal  variate.  Hence,  one  finds 
that  a  kind  of  Catch-22  applies  to 
lognormal  data  analysis.  The  normal 

probability  plot  can  be  used  to 
ascertain  the  value  of  the  easy-to- 
estimate  parameter  1I2 '  requires  the 

clumsy  expedient  of  repeated  trials  with 
a  variety  of  choices  for  the  value  of 
in  order  to  choose  between  alternative 
values  of  the  parameter.  Clearly,  it 
would  be  preferable  for  a  graphical 
procedure  to  estimate  the  values  of  p, 
and  o,  rather  than  the  values  of  ^2  ^rid 
o,  and  leave  the  problem  of  ^2 

estimation  to  the  solution  of  an  easily 
solved,  rather  than,  as  is  the  case  of 
estimation,  the  solution  of  a 
difficult  problem  (Cohen  1911,  Aitchison 
and  Brown  1917). 

If  as  an  alternative  to  t.ie  functional 


which  underlies  the  lognormal  plot,  one 
constructs  the  functional  R(x)  by 
following  the  following  two  steps:  Step 
1 ;  substitute  the  unknown  cumulative 
y  =  F(x)  into  y,  (to  detect  membership 
within  the  power-exponential  family) , 
into  minus  ylog(y)  (to  detect  membership 
within  the  Weibull-extreme  value 
family),  into  y(l-y)  (to  detect 
membership  within  the  Igglogistic 
family),  into  (  1+ [  tamr  (x-1/2 )  ]  ^  )  "^/ti  (to 
detect  membership  within  the  logcauchy 
family);  and  Step  2;  divide  the  above  by 
the  unknown  density  f(x)  where 
f(x)  =  F' (X)  one  finds  the  following: 

1)  Whenever  the  true  distribution  is 
exponential,  (or  for  other  special  cases 
of  R(x),  extreme-value,  logistic  or 
Cauchy)  the  functional  will  be  identical 
to  a  horizontal  line  of  height  equal  to 
the  value  assumed  by  the  scale  parameter 
of  the  model . 

2)  Whenever  the  true  distribution  is 
power,  (or  Weibull,  loglogistic  or 
logcauchy)  the  functional  will  be  a 
diagonal  line  whose  slope  and  Y- 
intercept  are  determined  by  the  scale 
parameter  and  pretransformation  location 
parameter  p^. 

3)  In  the  above  cases,  the  R(x) 
functional  will  be  unrelated  to  the 
value  of  the  posttransformation  location 
parameter  P2 • 

4)  The  above  properties  apply  to  the 
normal  and  lognormal  models  for  the 
special  case  of  the  R(x)  functional 

described  in  section  3  of  this  paper, 
which,  unlike  the  other  four  examples 
described  in  this  paper,  cannot  be 
expressed  in  closed  form  in  terms  of 
elementary  functions  but  instead  must  be 
expressed  in  terms  of  the  normal  inverse 
cumulative. 


2 .  The  Loatranf ormation  as  C  increases : 
We  will  now  show  that  in  one  important 
sense,  the  role  played  by  the 
pretransformation  location  parameter 
p,  -  C  is  in  actuality  the  opposite  of 
what  one  might  suspect  it  to  be.  It  is  a 
large  value  for  C  rather  than  a  small 
value  which  yielv>^  a  log  transform  which 
minimally  affects  the  pretransformed 
variate.  This  observation  can  be 
considered  to  be  a  corollary  to  the 


following  theorem: 

A  lognormal 

pre transformation 
parameter  -C, 

location  parameter 
scale  parameter 
normal  cumulative 


cumulative  with 

location 
posttransformation 
[log  C  +  p/C]  and 
0/C,  approaches  a 
with  location 


parameter  p  and  scale  parameter  a 
approaches  infinity . 


as  C 


Proof:  Consider  th;»t  a  lognormal 

cumulative  F(x),  as  defined  above,  can 
bo  expressed  as  4(v(x))  where  v(x)  is 
identical  _ 

j[log(x-'C)  -  loqC]/C  ^  -  pl/o  and  * 
represents  the  standard  normal 

cumulative.  By  applying  1 ' Hosp 1 1 a  1  ' s 
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Rule  twice,  one  finc^s  that  the  limit  of 
[log(x+C)  -  logC]/C”^  as  C  approaches 
infinity  equals  x.  Now  consider  the 
following  series  expansion  for  *(x) 
given  in  Kendall  and  Stuart  (1958)  p 
136:  _  *(X)  =  (1/2)+ 

(27r)  1/2(j^_[x3/6]  [l-(3/10)  (x^/2  I  )  +  ...]  ) 

Each  term  of  the  bracketed  series  is 
less  than  or  equal  in  value  to  a 
consecutive  term  of  the  power  series 
expansion  of  cos(x).  By  Abel's  Theorem 
the  power  series  expansion  of  cos(x) 
converges  uniformly  for  any  value  of  x 
and  thus,  by  the  comparison  test,  the 
above  power  series  expansion  of  $(x)  is 
everywhere  uniformly  convergent. 

Let  represent  the  n-term  partial 

sum  or  the  above  power  series.  Consider 
the  double  limit 

Lim  Lim  ( { log  (x+C)  -  [  logC+ij./C]  }/ (a/C)  )  , 
expression  (1),  where  the  inner  limit  is 
taken  as  C  approaches  oo  and  the  outer 
limit  is  taken  as  n  approaches  <». 

Each  individual  *n(^)  ®  finite  sum  of 

powers  of  the  curly  bracketed  expression 
and  thus  the  inner  limit  equals 
(  (x-ji) /o)  since  the  limit  of 
C[log(x+C)  -  logC] ,  as  C  approaches  «, 
was  shown  above  to  equal  x.  From  Theorem 
7.11  of  Rudin(1953),  since  approaches 
i  uniformly,  the  order  of  the  two  limits 
of  expression  (1)  can  be  reversed  which 
proves  the  theorem. 

Since  unlike  the  normal  cumulative  $, 
the  other  four  loqmodels  considered  in 

this  paper  have  cumulative  distribution 
functions  which  are  closed  forms  of 
elementary  functions,  one  does  not  need 
the  elaborate  argument  presented  above 
to  prove  that  the  above  feature  of  the 
limit  of  the  log  transformation  as  C 
approaches  infinity  applies  to  these 
models . 

3 .  Classes  of  Loqmodels  and  the  R(x) 
functional :  A.  C.  Cohen's  (1951) 

pioneering  paper  begins  with  the 
sentence:  "The  logarithmic  normal 

distribution  provides  a  useful 
theoretical  model  for  studying  a  number 
of  biological  populations,  certain 
economic  populations  involving  income 
distributions,  and  others  in  which  the 
standard  deviation  of  individual 
observations  is  approximately 

proportional  to  the  magnitude  of  the 
observations."  The  last  part  of  this 
sentence  is  puzzling  since  it  is  hard  to 
see  how  an  individual  observation  can 
have  a  standard  deviation.  However,  it 
will  be  shown  below  that:  There  is  a 
property  of  the  lognormal  model  v;hieh  is 
both  linear  and  connected  with  the 
standard  deviation.  There  is  exact,  as 
opposed  to  approximate  proportionality 
involved;  and,  in  the  opinion  of  the 
authors  most  importantly,  this  property 
is  shared  with  at  least  four  other 
important  classes  of  distributional 
models . 

In  the  remainder  of  this  paper  the 


following  notation  and  assumptions  will 
be  utilized: 

1)  The  symbols  0,  i  and  will 

represent  the 

standard  normal,  i.e.  Gaussian,  density, 
cumulative  and 

inverse  cumulative  respectively. 

2)  The  symbols  g,  G  and  G”^  will 

represent  respectively 

the  standard  forms  of  the  density, 
cumulative  or  inverse 

cumulative  of  any  one  of  the  following 
five  distributions:  Positive 

exponential,  extreme  value,  logistic, 
Cauchy  or  normal,  defined  over  an 
interval  (a,b)  where  G”^G(x)  is 
identically  equal  to  x  for  any  x  within 
(a,b).  _ 

3)  The  symbols  f,  F  and  F  ^  will 
represent  respectively 

the  density,  cumulative  and  inverse 
cumulative  such  that 

for  some  G  defined  above,  F(x)  is 
identically  equal  to 

G  ( [  log(  x-n-j^ ) -M,2  ]/o)  .  Tarter  and  Kowalski 
(1972)  considered  the  case  where  G  is 
identical  to  $.  It  was  shown  that  the 
functional  R(x)  defined  as 
F(x)/f(x)  has  the  following  three 
properties: 

PROPERTY  1:  R(x)  =  o  for  all  finite  x  if 
and  only  if  F(x)  is  a  normal  cumulative 
with  standard  deviation  o. 

PROPERTY  2:  R(x)  =  a(x-a)  for  any  x  >  a 
and  zero  elsewhere  if  and  only  if  F(x) 

is  a  three  parameter  lognormal 
distribution  function  as  defined  above 
for  G  identically  equal  to 
PROPERTY  3:  R(x)  =  a(x-a)^  for  all 

finite  x  if  and  only  if  F(x)  is  a  three- 
parameter  reciprocal  normal  distribution 
function,  i.e.,  if  and  only  if  F(x)  = 
( ( (a-x)  ^  -  4)/a] . 

PROPERTY  4:  R(x)  =  (x-pi.)  ( -x)  /  ( ij.. ) 

for  m.j^<x<^.2  and  zero  elsewhere  if  and 
only  if  F(x)  is  a  distribution  with 
associated  random  variate  X  which  can  be 
translormed  to  a  normal  variate  Z  by  the 
transformation  Z  =  Log  (  (X-n^^)  /  (U-^-X)  )  . 
The  remainder  of  this  papeTr  will 
consider  choices  of  the  function  gG”^ 
within  the  definition  of  R(x) ,  other 
than  the  function  04- It  will  be  shown 
that  properties  one  and  two  have  several 
useful  extensions  to  a  variety  ol 
statistical  models.  However,  before 
turning  to  applications  of  gG”^ 
alternatives,  it  seems  useful  to 
reconsider  the  Cohen  statement  that  the 
"standard  deviation  of  indi'/idual 
observations  is  approximately 

proportional  to  the  magnitude  of  the 
observat ions . " 

Notice  that  one  can  tiv'at  PRcrKRTY  1  as 
the  limiting  case  of  I’ROI'KRTY  .7.  This 
observation  implio:-.  in  turn  that  the 
normal  density  can  bo  treated  as  a 
special  ('a.'ie  of  the  lognormal  density. 
(Note  that  a  linear  R(x)  with  an 
extremely  small  hut  nonzero  slope 
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characterizes  a  lognormal,  while  a 
perfectly  horizontal  line  characterizes 
a  normal  model) . 

There  is  a  concrete  way  to  view  the 
relationship  between  F-type  and  G-type 
models  where  F{x)=G(  [log{x-ii]^}-M.2]/o) 
and  G  represents  the  cdf  of  either  an 
exponential,  extreme  value,  logistic, 
Cauchy  or  normal  variate.  Consider  a 
technique  for  simulating  data  from  a 
specified  distribution  referred  to  in 
Tocher  (1963)  pages  22-24  as  the  "Method 
of  Mixtures."  The  method  is  based  in 
part  on  partitioning  the  support  region 
of  the  distribution  from  which  data  are 
to  be  simulated  into  class  intervals,  in 
a  manner  similar  to  the  procedure  by 
which  conventional  histograms  are 
constructed . 

Let  the  symbol  x^  represent  an 
arbitrary  left  end-point  of  a  class 
interval  and  x^+e  represent  the  right 
end-point  of  this  same  interval,  where 
of  course,  e  >0.  Furthermore,  define 
A  =  G(  [log(X3-U2^) -U2]/<^) 

B  =  G(  [log(x,+e-M.T  ) -M.2]/a)  to  be  the 
areas  under  the  density  from  which  data 
ere  to  be  simulated,  to  the  left  of  the 
left  and  right  class  interval  endpoints 
respectively.  Now  suppose  that  as  one 
would  do  by  the  method  of  mixtures,  one 
wishes  to  simulate  data  from  one  of  the 
above  logmodels  by  using  data  generated 
from  the  associated  model .  To  assure 

that  the  probability  mass  over  the  above 
class  interval  equaled  B-A,  one  could 
solve  the  system  of  two  equations, 
agG”^(A)  +  p.a  =  Xg  and  OgG“^(B)  +  na  = 
x-  +  e  for  the  constants  Hg  and  Cg  and 
find  that 

Og  =  c./[  ( log(x  +£-M.j^) -log(Xg-^T )  )/a]  . 

If  one  applies  1' Hospital's  rule  one 
finds  that  as  the  length  of  the 
interval,  e,  approaches  zero,  the  scale 
parameter  Cg  approaches  (Xg-M.j^)o,  i.e., 

one  can  use  the  method  of  mixtures  to 
simulate  data  from  any  logmodel  by  using 
data  generated  from  the  associated 
parent  model  and  linearly  varying  the 
scale  parameter  of  the  parent  model 
data . 


An  alternative  view  which  can  be  used 
to  visualize  the  above  relationship  is 
as  follows:  For  a  small  positive  value 
of  £,  consider  R(x)  =  qG”^F(x)/f (x)  for 
X  within  the  interval  [Xg,Xg+£].  Since 
as  defined  above,  F  is  the  cdf  of  either 
a  power,  Weibull,  loglogistic,  logcauchy 
or  lognormal  variate  with  scale 
parameter  a,  R(x)  will  be  a  line  with 
slope  equal  to  ^  Consider  the  line 
segment  that  connects  the  point 

(Xg,R(Xg))  to  the  point  (Xg+£ , R (Xg+£ ) )  . 

Suppose  that  this  line  segment  is 
rotated  about  the  point 

(Xg  +  £/2 , R (X  +£/2 )  )  in  order  to  form  a 
horizontal  line  segment  whose  height  is 
identically  equal  to  R(Xg+e/2).  By  the 
uniqueness  of  the  gener^  solution  of 


the  first  order  differential  equation 
which  defines  R(x) ,  the  only  analytic 
cdf  which  can  determine  such  a 

horizontal  R(x)  within  the  small 

interval,  must  be  identical  to 

F(x)  =  G(  [x-M,3]/ag)  for  some  value  of 
43  and  for  Og  identically  equal  to 
(Xg+£/2-ti3)  times  o.  In  other  words,  the 
scale  parameter  Og  is  a  linear  function 

of  Xg. 

The  above  explains  the  previously 
referred  to  characteristic  of  the 
lognormal  model  suggested  by  Cohen,  and 
shows  that  this  characteristic  is  shared 
by  all  logmodels,  each  of  which  can  be 
thought  of  as  a  composition  of  "parent" 
model  evaluations  where  the  scale 
parameter  of  the  parent  model  increases 
linearly.  In  essence,  the  R(x) 
functional,  is  this  linear  relationship. 
The  term  "parent"  model  corresponds  to 
some  G  defined  above,  while  the  term 
"composition"  refers  to  the  limit,  as  £ 
approaches  infinity,  of  some  function 
which  in  any  interval  [Xg,Xg+£]  is 
identical  to  G  ( [X-113  J/Og)  . 

To  clarify  Cohen's  statement,  it  is  not 
the  data  points  themselves  which  have  an 
increasing  scale  parameter,  but  the 
pieces  of  the  above-defined  density 
mixture.  Furthermore,  all  five  of  the 
logmodels  considered  in  this  paper,  and 
not  simply  the  lognormal,  have  this 
relationship.  One  can  of  course 
generalize  Cohen's  statement  for  the 
case  of  any  cumulative  G([v(Xg-4,)- 
42)/®)  where  v  is  monotonic  function 
with  a  first  derivative  v'  for  which  Og 
=  o  [  V '  (Xg-43 )  ]  “^  .  It  is  the  fact  that 
the  reciprocal  of  the  derivative  of  the 
function  v(x)  =  log(x)  equals  x,  that 
underlies  the  convenience  of  the  R(x) 
graphical  approach. 

4.  Examples  of  logmodels:  In  order  to 
demonstrate  the  generality  of  the 
modular  estimation  of  R(x),  it  seems 
appropriate  to  detail  the  application  of 
this  functional  to  the  goodness-of-f it 
of  a  sequence  of  model  families  more 
general  than  the  five  models  described 
in  section  1.  Suppose  0,  4,  G,  g,  F,  and 
f  are  all  defined  as  in  the  previous 
section  except  that 
F(x)  =  H[ (Log{ Log (x-4, ) -42 >-43) /o] 

=  H[Log( Log(x-Ujt-42 r^°-43/a] 
where  G(x)  =  H[Logx^"°]  =  H{ ( l/o) Log (x) ) 
and  H,  rather  than  G,  is  a  standard 
exponential,  extreme  value,  logistic 
Cauchy  or  normal  cdf.  For  these 

"loglogmodels ,  "  the  gG”^  component  of 
the  R(x)  functional  is  identical  to 
hH  ^(y)exp[H  ^(y)].  Hence,  not  only  can 
one  apply  the  R(x)  functional  to  the 
problem  of  graphically  selecting  between 
five  commonly  used  logmodels  f,  but  one 
can  use  an  approach  similar  to  the 
conventional  application  of  the 
lognormal  probability  plot  to  estimate  a 
and  the  throe  additional  parameters  4^, 
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j  =  1,2,3  of  any  loglogmodel .  Instead  of 
selecting  a  trial  value  for  the 

pretransformation  location  parameter 
and  checking  the  straightness  of  the 
resulting  lognormal  probability  plot, 
one  can:  1)  select  a  trial  value  for  the 
scale  parameter  a; 

2)  check  by  the  straightness  of  the  R(x) 
functional  on  the  accuracy  of  this 
choice  for  a; 

3)  when  a  straight  R(x)  is  obtained,  one 
can  estimate  the  parameters  and  ^2 
terms  of  the  estimated  slope  and 
y-intercept  to  R(x) ; 

4)  in  the  univariate  case,  once  is 

estimated,  one  can  transform  each  data 

it  it 

point  X;  to  Log[Xj^  -  ]  where  p-j^ 

represents  the  graphically  obtained 
estimator  of  pi  and  finally; 

5)  now  that  the  problem  of  estimating 
the  parameters  of  a  loglogmodel  has  been 
reduced  to  the  problem  of  estimating  the 
parameters  of  a  logmodel,  one  can  use 
the  R(x)  functional  again  to  check,  by 
the  straightness  of  R(x)  ,  on  the 
validity  of  the  choice  of  the  underlying 
h  as  one  of  the  five  model  families 
considered  above. 

It  is  appropriate  to  point  out  that  by 
extending  the  derivation  detailed  in 
section  3,  one  can  show  that  each 
logmodel  is  a  special  limiting  case  of  a 
loglogmodel.  Consider  the  Pearson  family 
and  most  other  methods  used  to 
■generalize  the  normal  model.  In  the  case 
of  the  Pearson  family,  it  is  the 
coefficients  of  terms  in  a  quadratic 
denominator  whose  common  zero  value 
reduces  the  general  model  to  the  special 
normal  case  (Elderton,  1953,  p.49).  In 

both  the  logmodel  and  loglogmodel 
systems,  it  is  the  value  infinity  which 
"reduces"  the  general  model  to  its 
special  case.  However,  unlike  the 

Pearson  family  and  any  other  generalized 
model  family  with  which  we  are  now 
aware,  only  the  loglogmodel  system  has 
both  the  normal  and  the  lognormal  as 
special  cases. 

It  is  also  of  some  interest  to  compare 
the  approach  to  model  generality 

considered  in  this  section  to  the 

Johnson  Family  of  Distributions  (Johnson 
1949,  Tapia  and  Thompson,  1978,  p  30- 
33).  Consider  that  there  are  two  basic 
assumptions  involved  in  use  of  the 
lognormal  model.  It  is  assumed  that:(l) 
a  logarithmic  transformation  will,  (2) 
transform  one's  data  to  normality.  In 
essence,  the  Johnson  Family  approach 
generalizes  assumption  (1)  while  the 
methodology  considered  here  generalizes 
assumption  (2).  In  the  next  section  of 
this  paper  the  question  of  the 
generality  of  the  R(x)  function  itself 
will  be  considered  and  it  will  be 
asserted  that  the  connection  between 
this  type  of  graphical  method  and  the 
logarithmic  transformation  is  not 
necessarily  shared  by  alternative 
distribution  systems. 


S.  Conditions  for  the  Existence  of  an 
Rfx)-tVDe  Graphical  Method:  In  order  to 
consider  the  general  properties  of  the 
previously  described  modular  methods,  it 
is  useful  to  represent  the  functional  R 
by  R[  f  (x;©-]^ ,  02 )  ,  F  (x;©]^ ,  62)  ]  where  f  and 
F  represent  the  hypothesized  pdf  and  cdf 
of  the  underlying  random  variate  which 
are  specified  up  to  the  values  of  the 
two  unknown  parameters  6^  and  ©2.  The 
following  question  then  arises:  For  what 
classes  of  statistical  models,  will 
there  exist  a  differentiable  function  R 
with  the  following  two  properties: 

1;  For  a  fixed  value  of  02,  R  will  be 
identically  equal  to  a  constant  for  all 
values  of  the  random  variate  X  =  x. 

2;  For  a  fixed  value  of  ©2,  the  value  of 
R  will  not  change,  i.e.  R  will  assume  a 
constant  value,  for  any  value  assumed  by 
the  parameter  ©j .  These  are  two  of  the 
properties  which  make  the  R(x) 
functional  particularly  useful.  For 
example,  in  the  case  of  the  normal  model 
{<>[  (x-p)/a]  )/o,  ©j^  =  p  and  ©2  =  o,  the 

value  assumed  by  R  is  equal  to  a  for  any 
value  of  X,  i.e..  Property  1,  and  R  is 
functionally  unrelated  to  p  ,  i.e., 

Property  2 . 

If  one  restricts  R  and  F  to  the  class  of 
differentiable  functions,  it  is  easy  to 
obtain  necessary  conditions  for  R  to 
have  Properties  1  and  2.  Let  u  and  v 
represent  the  two  arguments  of  the 
function  R.  By  the  chain  rule. 
Properties  1  and  2  imply  that 
{ SR/Su) ( 5f (x;©i,©2)/6x)+( 5R/5V) 
f  (x;©i,©2)=0 

(  SR/5u)  {  5f  (x:©]^  ,©2)/&ei )  + 

(  6R/5V  )  {  5F(x;©j^,©2)/5ej^ )  =  0 
The  truncated  exponential  model  provides 
an  example  of  a  parametric  family  which 
does  not  satisfy  the  above  necessary 
conditions.  Specifically,  suppose 

f(x)  =  (b/o) exp(-x/o) I , 0  B] '  where 

IrQ  g,  represents  the  indi'caxor  function 
of  the  closed  interval  [0,B],  and  b  is 
chosen  to  assure  that  the  definite 
integral  of  f  over  [0,B]  equals  one.  For 
this  choice  of  f,  the  above  necessary 
conditions  imply  that 

(6R/6u){  -b/o^ )  +  (6R/6v){  b/a )  =  0  and 

{ 6R/6u I ( (I/O) exp (-x/o)+ 

{ £R/6v ){ 1-exp ( -x/o) )=0, 
which  in  turn  implies  that 
-exp(-x/o)/[ l-exp(-x/o) ]  identically 

equals  one.  Hence,  there  cannot  exist  a 
differentiable  function  R  which  has 
propertie;  1  and  2  for  the  special  case 
of  the  truncated  exponential  model.  On 
the  other  hand,  for  any  functions  g,  G 
and  defined  in  section  3  and  h,  H 

and  H”^  where  h(x)  =  g([x  -p ] /o ) /o^  H ( x) 
=  G([x-p)/o)  and  H”^(y)  =  p  +  oG  ^(y), 

the  necessary  conditions  imply  that 
( 6R/6u )/{ SR/Sv  )  equals  -h(x)/h'(x), 
which  is  satisfied  by  R(u,v)  = 
gG"^ (v)/u. 
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6.  A  Description  of  the  Modular  Model 
Selection.  Slicing  and  Sectioning 
Program:  The  principal  goal  set  during 
the  construction  of  this  program  was 
that  the  model  selection  process  was  to 
be  based  on  either  an  estimated  slice, 
i.e.  conditional;  section,  i.e., 
separated  distributional  subcomponent 
or;  (and  in  the  authors'  opinion,  the 
least  applicable  case)  an  estimated 
marginal.  Thus,  the  program  differs 
radically  from  previous  goodness-of-f it 
subroutines,  such  as  that  included 
within  STATGRAF  (1988)  which  can  only  be 
used  to  check  on  model  appropriateness 
for  univariate  data. 

The  main  menu  of  the  program  allows  a 
user  to  select  a  file  of  univariate  or 
bivariate  data  and  estimate  a  marginal 
of  either  of  two  variates  or 
alternatively  estimate  a  conditional 
slice  of  one  variate  given  any  selected 
value  of  the  second  variate.  The  first 
call  to  the  conditional  slice  subroutine 
initiates  the  computation  of  those 
bivariate  sample  Fourier  coefficients 
(trigonometric  moments)  found  to  be 
appropriate  for  the  estimated 
distribution  and  available  sample  size. 
(Tarter  and  Kronmal  1970,  describes  the 
theory  which  underlies  this  procedure.) 
For  samples  larger  than  one  thousand 
from  multicomponent  mixtures,  the 
execution  of  this  one  step  can  take  as 

long  as  three  minutes  on  an  IBM  AT 
compatible  PC  with  a  6  MHz  clock. 
However,  since  for  any  given  set  of 
data,  all  subsequent  slicing,  sectioning 
and  model  selection  procedures  are  based 
on  this  same  set  of  estimated  Fourier 
coefficients,  all  further  calculations 
can  be  performed  with  only  from  two  to 
fifteen  second  response  time  delays  on 
the  6MHz  AT-type  PC. 

One  of  the  main  menu  options  is  labeled 
CONTRAST.  This  option  allows  a  user  to 
enter  a  value  of  a  constant  which 
separates  individual  estimated 
distributional  components.  (Details  of 
this  procedure,  but  not  the  model 
selection  process,  are  described  in 
Tarter  1979,  section  3.)  If  the  CONTRAST 
option  is  selected  immediately  after  a 
conditional  slice  has  been  estimated,  a 
user  can  section  a  conditional  in 
exactly  the  same  way  that  he  or  she  can 
section  a  marginal,  if  the  CONTRAST 
option  is  executed  immediately  after  a 
marginal  is  estimated. 

The  two  options  listed  in  the  program's 
main  menu  which  are  most  germane  to  this 
paper  are  labeled  R(X)  and  FIT  MODEL. 
Each  of  the  five  model-logmodel  families 
considered  above  is  keyed  to  a 
particular  color.  By  using  a  combination 
of  R(Xj  and  MODEL  SELECTION  options,  a 
user  can  graph  any  combination  of 
estimated  R(x)  functionals  or  fitted 


models  or  logmodels.  The  choice  of  color 
is  used  to  identify  the  model  upon  which 
an  estimated  R(x)  and  fitted  density  are 
based.  A  dotted  line  is  used  to  graph  an 
estimated  logmodel  while  a  dashed  line 
of  the  same  color  is  used  to  identify 
the  associated  model. 

One  major  practical  problem  was 
encountered  when  the  primary  bivariate 
estimation  subroutines  were  combined 
with  the  sectioning,  slicing  and  R(X) 
model  identification  subroutines.  The 
standard  forms  used  to  represent  each  of 
the  five  model  systems  considered  in 
section  4  in  no  way  took  into  account 
the  problem  of  comparing  the  fit  of 
these  models.  One  of  the  authors  of  this 
paper  had  encountered  this  difficulty 
twice  in  the  past.  In  Kronmal  and 
Tarter  (1968)  it  was  found  that  in  order 
to  compare  the  Mean  Integrated  Squared 
Error  characteristics  of  nonparametric 
estimates  obtained  from  normal  and 
Cauchy  data,  one  needed  to  introduce  a 
variant  of  the  Cauchy  "standard  model" 
that  was  comparable  to  the  standard 
normal,  $,  in  the  sense  that  the  two 
standard-form  models  had  the  same  first, 
second  and  third  quartiles.  In  Tarter 
(1968),  the  optimal  scale  parameter 
coefficient  of  the  standard  logit, 
logCy/fi  "  y))  was  found  which  provides 
the  best  fit  of  this  model  to  the 
inverse  standard  normal  cumulative  based 
on  the  integrated  squared  error  metric. 

It  might  also  be  pointed  out  that  the 
constant  two,  chosen  as  a  divisor  of  -x^ 
within  tr.e  "standard"  normal  exponent 
-x^/2,  serves  the  sole  purpose  l.' 
assuring  that  the  scale  parameter  a  of 
the  nonstandard  normal  is  identical  to 
the  property  of  the  distribution  usually 
referred  to  as  the  "standard"  deviation. 
In  fields  such  as  optics,  where  the 
normal  is  used  without  need  for  the 
convenience  of  scale  parameter  and 
standard  deviation  identity,  the 
standard  distribution  is  defined  without 
the  constant  two. 

Since  the  property,  standard  deviation, 
does  not  exist  for  the  Cauchy  model,  the 
standard  form  of  this  distribution  is 
usually  chosen  for  reasons  of  simplicity 
and  unlike  the  normal,  no  attempt  is 
made  to  identify  the  scale  parameter 
with  a  property  of  the  distribution. 
This  means  that  there  is  no  particular 
reason  to  expect  that  the  standard  form 
of  the  Cauchy  will  be  in  any  way 
comparable  to  the  standard  form  of  the 
normal  . 

The  logistic  inverse  cumulative,  i.e., 
loi(y/[l  “  y])>  is  so  simple  a  function, 
that  no  attempt  is  usually  made  to  see 
that  the  scale  parameter  of  this  model 
is  identical  to  any  distributional 
property  or  comparable  to  the  standard 
normal  inverse  cumulative  Only  the 
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exponential  model  seems  to  share  with 
the  normal  the  distinction  that  its 
standard  form  allows  a  parameter  of  the 
nonstandard  form  to  be  associated  with 
an  easily  interpreted  property. 
Specifically,  if  f(t)= 
exp(-Tt)IrQ  ooi  '  then  T=l/E(t),  Gross 
and  Clark'^  (1975)  ,  p.52. 

One  can  use  the  fact  that  families  of 
logmodels  have  the  above-mentioned 
arbitrariness  to  one's  advantage,  by 
comparing  two  alternative  R(x) 
functionals  which  are  graphed  so  that 
they  assume  approximately  the  same  value 
at  the  estimated  mode  of  one  of  the 
densities.  In  the  case  where  a  model  is 
correctly  selected.  Property  2  of 
section  3  implies  that  the  true  R(x) 
functional  will  be  a  line  with  slope 
identically  equal  to  the  scale  parameter 
of  the  model.  It  should  also  be 
mentioned  that  the  above  dependence  of  a 
graphical  method  on  the  definition  of 
the  "standard  model"  is  shared  by 
conventional  graphical  methods  which 
rely  on  a  plot  of  estimated  cumulative 
probability  on  a  y-axis  which  has  been 
transformed  to  "standard"  normal  scale, 
against  a  transformed  scale,  e.g.,  log 
or  square  root,  on  the  x-axis,  as 
illustrated  in  Dixon  and  Massey  (1983) 
page  488.  If  one  were  to  prepare  Cauchy 
or  logistic  probability  paper  to  compare 
normal  to  Cauchy  or  logistic  fit,  one 
would  find  that  the  slope  of  the  plot 
obtained  under  the  null  hypothesis  of 
true  fit,  would  be  dependent  upon  the 
definition  of  scale  parameter  of  the 
nonstandard  model,  i.e.,  F. 

The  R(x)  functional  has  a  slope  equal  to 
zero  in  the  null  hypothesis  case  for  any 
of  the  five  parent  models  considered  in 
this  paper.  Hence,  besides  being  far 
less  dependent  on  choice  of  standard 
model  than  is  the  conventional 
lognormal-type  plot,  it  also  tends  to 
circumvent  a  subtle  problem  associated 
with  the  Kolmogorof  f-Smirnof  f ,  K-S,  or 
Chi-square  goodness  of  fit  procedures 
included  in  such  packages  as  STATGRAF 
(1986).  To  use  these  two  procedures,  the 
parameters  of  the  "fitted"  models  must 
be  estimated  before  comparisons  can  be 
made.  In  essence,  this  restriction 
confounds  the  problem  of  estimator 
efficiency  with  the  problem  of  model 
specification.  (Tapia  and  Thompson  1978 
section  1.4  contains  an  excellent 
description  of  this  form  of 
confounding) .  The  same  estimators  of  the 
pdf  f  and  cdf  F  are  used  to  estimate  all 
R(x)  curves  and  thus,  the  methods 
described  in  this  paper  tend  to 
circumvent  the  problem  of  specification, 
parameter-estimation  confounding . 

If  the  estimator  of  f  or  F  is  poor  at  a 
distance  plus  or  minus  C  from  the 
estimated  mode  of  f,  one  can  simply 
shorten  the  region  over  which  R{x)  is 


estimated  by  changing  K  to  some  smaller 
value,  e.g.,  2K/3.  Due  to  the  choice  of 
the  mean  integrated  squared  error 
metric,  MISE,  which  underlies  most 
kernel  and  series  density  estimation 
procedures,  estimators  can  be  said  to  be 
"center-weighted,"  in  much  the  same  way 
that  the  automatic  exposure  meters  built 
into  many  current  35mm  cameras  are 
designed  to  provide  the  most  accurate 
estimates  of  light  conditions  near  the 
image  center.  Thus,  one  can  be  fairly 
sure  that  as  the  constant  K  referred  to 
above  is  reduced,  one's  estimator 
accuracy  will  improve.  Of  course  there 
are  limits  to  the  effectiveness  of  this 
procedure  since,  as  is  true  in 
photography,  peripheral  information  may 
be  important  even  if  it  is  slightly  out 
of  focus. 

In  the  context  of  alternatives  to  the 
R(x)  approach,  one  could  choose  to 
modify  the  Chi-squared  goodness  of  fit 
test,  and  only  compare  observed  to 
fitted  model  frequencies  within  a 
subinterval  of  the  model's  support. 
However,  when  one  chooses  to  do  this, 
one  is  faced  with  the  dilemma  of  whether 
or  not  one  should  base  the  estimation  of 
the  parameters  of  the  fitted  model  upon 
a  data  set  which  is  censored  to  solely 
include  values  which  lie  within  the 
restricted  subinterval.  In  the 
univariate  case,  one  could  attempt  to 
use  BLU  procedures  to  obtain  estimates, 

but  in  the  case  where  it  is  a  truncated 
conditional,  i.e.,  slice  or  single 
section  of  a  multicomponent  density  that 
is  fitted,  it  is  hard  to  see  how  one 
would  proceed  to  solve  this  problem, 
even  if  one  could  modify  the,  K-S  or 
Chi-squared  goodness  of  fit  procedure  to 
deal  with  slices  or  sections.  Even  if 
one  could  find  a  variant  which  would 
apply  to  sections  or  slices,  in  the 
process  of  generalizing  these  two 
procedures,  one  would  assuredly  lose  the 
major  advantage  that  these  two  methods 
have  over  the  use  of  R(x)  when  the 
procedures  are  used  in  the  univariate 
case.  Specifically,  it  is  hard  to  see 
how  accurate  significance  levels  could 
be  obtained  for  these  generalizations. 

Besides  their  reliance  on  model-speci fin 
estimation  procedures,  it  is  also 
relevant  to  point  out  that  both  the  Chi- 
square  goodness  of  fit  and  the  K-S 
method  are  associated  with  particular 
choices  of  nonparametric  estimators. 
Through  its  dependence  on  class  interval 
f requencir  i ,  the  Chi-square  method  is 
closely  connected  with  the  conventional 
histogram  while  K-S  methodology  is  of 
course  dependent  upon  the  .sample 
cumulative  stop  'unction.  In  Tarter  and 
Kronmal  (1970)  it  i.s  shown  that  in  terms 
of  the  underlying  MISF  metric,  there 
will  always  exist  a  Fourier  serio;; 
dens.ity  est.im.itor  which  i  .s  superioi-  to 
it;-,  limiting  case,  which  in  the  case  of 
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unweighted  series,  happens  to  be  the 
sample  cumulative.  Thus,  one  has  good 
reason  to  suspect  that  a  model 
identification  method  which  is  based  on 
the  sample  cumulative  can  be 
substantially  improved  upon  by  the 
methods  described  above. 

In  the  case  of  a  methodology  which  is 
related  to  the  conventional  histogram, 
by  the  extending  the  same  logic  which 
leads  the  STATGRAF  and  other  Chi-Square 
goodness-of-f it  programs  to  allow  end 
intervals  to  be  pooled,  one  would  think 
that  methods  such  as  those  presented  by 
Scott  (1984)  ,  which  greatly  improve  the 
conventional  histogram,  could  be  very 
successfully  applied  to  find 

improvements  of  the  conventional  Chi- 
square  goodness  of  fit  test.  Of  course 
one  could  also  use  these  methods  to 
estimate  the  f  and  F  arguments  of  the 
R(x)  functional  and  in  this  way 
circumvent  the  problem  of  estimating  the 
parameters  of  the  "fitted"  model  as  a 
preliminary  to  checking  on  the  validity 
of  model  family  choice. 

Generalizations;  There  are  of  course 
many  possible  ways  of  generalizing  the 
methodology  presented  above.  For 
example,  PROPERTY  3  of  section  4  implies 
that  if  one  chooses  to  graph  the 
derivative  of  R(x) ,  R' (x) ,  one  can 
select  from  among  members  of  the  class 
of  "reciprocalmodels"  by  making  use  of 

the  identity  R' (x)  =  2o(x-a)  where  a  and 
a  are  defined  in  section  4. 

In  the  bivariate  case,  one  can  combine 
the  procedures  described  in  this  paper 
with  those  presented  in  Tarter  and 
Freeman  (1987)  and  obtain  a  graphical 
method  for  distinguishing  between  two 
possible  departures  from  the  assumptions 
which  underlie  linear  regression,  where 
regression  residuals  can  have  any  one  of 
the  logmodels,  loglogmodels  or  even 
recipricalmodels  as  their  distribution. 
Unlike  the  logmodel  case,  we  have  not 
found  it  useful  to  pursue  the  above 
lines  of  inquiry  at  this  time.  Instead, 
the  sensitivity  of  the  R(x)  estimates 
obtained  in  our  GKS-FORTRAN  interactive 
graphical  system,  clearly  suggest  one 
particular  area  of  further  fruitful 
investigation.  By  forming  composites  of 
various  available  nonparametr ic  density 
estimators,  where  each  composite  is 
customized  for  a  particular  R(x) 
application,  we  hcpo  to  both  facilitate 
certain  areas  of  data  exploration  and 
simultaneously  learn  more  about  the 
advantages  and  limitations  ot  the 
estimators  upon  which  the  application  of 
the  R(x)  functional  depends. 
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ROBUSTNESS  OF  WEIGIfTED  ESTTMATORS  OF  LOCATION:  A  SMALL-SAMPLE  STUDY 


Gregory  Campbell  and  Richard  T.  Shrager,  National  Institutes  of  Health 


TTie  problem  of  estimation  of  location  is 
considered  in  the  context  of  known  as  well  as 
misspeclfied  weights.  For  the  one-sample 
problem,  the  st\idied  estimators  Include  weighted 
analogs  of  the  mean,  the  median,  the  median  of 
the  Walsh  averages,  and  Huber  M-est imators ,  as 
well  as  a  computer- i  nt  ens  ivc  proced\ire  which 
minimizes  the  weighted  sum  of  absolute  values  of 
pairwise  sums  and  differences  of  deviations.  For 
estimators  which  employ  a  weighted  median, 
interpolation  to  improve  performance  is 
considered.  The  estimators  are  evaluated  by 
computer  simulation  with  respect  to  robustness  to 
weight  m isspec i f i ca t ion  as  well  as  robustness  to 
outliers.  These  s  imiilat  ions  ,  together  with  the 
Kantorovich  inequalities  for  bounds  on  the 
asymptotic  inefficiencies,  provide  insight 
concerning  the  performance  of  these  estimators 
with  m isspec i fled  weights. 

1 .  INTRODUCTION 

There  are  many  situations  in  which  there  is  a 
natural  weight  associated  with  each  of  the 
observations.  (Here  the  weight  refers  to  a  fixed 
weight  attached  to  each  observation  as  opposed  to 
he  W-estimates  of  Tukey  or  to  the  iterative 
reweighting  schemes  so  useful  in  the.  calculation 
of  some  estimators.)  For  example,  If  each 
r>hservat  ion  is  the  summary  measure  of  location 
for'  a  group  of  data,  one  might  use  the  inverse  of 
some  measure  of  dispersion  for  the  weight.  Or  in 
the  regression  prooiem  there  are  cases  tn  which 
the  optimal  weights  of  the  pairwise  r:lope..s  depend 
on  the  spacing  of  the  design  (independent) 
variables;  see  Jaeckel  (1972)  and  Scholz  ( 197fi) 
for  details.  In  proportional  sampling 
situations,  it  is  often  the  case  that  the  welglits 
are  merely  the  known  probabilities  associated 
with  the  sampling  plan. 

In  the  early  1970's,  the  Princeton  Study 
(Andrews  et  al.,  1972)  looked  at  the  question  of 
robustness  for  the  one-sample  location  problem. 

Of  particular  interest  in  that  studv  was  the 
robustness  of  the  estimators  of  location  to  the 
presence  of  outliers,  and  to  a  lesser  extent,  to 
the  m i .s.spnr.  i  f  i cat  i on  of  the  distribution.  In 
addition  to  those  varieties  of  robustness,  the 
present  study  consi<lnrs  the  behavior  of 
estimators  of  location  for  tiio  situation  In  whic,!» 
the  observations  have  weights  attached  to  them 
but  the  weights  are  oither  known  and  correctly 
sper.ifiod  or  else  m  i  sspec  i  f  I  ed  .  For  example, 
suppose  that  the  r)bservat  if)ns  are  weighte*!  but 
the  w^'ights  are  ignored  and  an  equal -weight 
estimator  is  employed.  How  robust  is  such  an 
estimator  to  this  m  I  sspec  i  f  i  c.at  i  on?  Also  of 
interest  is  the  robtistness  to  d  I  s  t  r  I  hti  t  1  on  and  to 
otitliers.  It  Is  impossible  to  study  every 
estimator;  in  particular,  most  estimators  whif.h 
are  relatively  inefficient  but  have  a  very  hlgli 
breakdown  constant  are  not  included,  In  that  for 
contamination  rates  near  it  is  not  clear  whal 

is  contaminating  what.  As  (or  robustness  to 
distribution,  suppose,  for  example,  that  the 
distribution  Is  thr>iight  to  he  fioiiblo  exponential 
but  really  turn.'’,  out  to  be  normal?  f?r  primary 


interest  l.s  the  effect  on  these  e.stimators  ij? 
small -sample  cases  when  the  weights  are 
m i sspec i f i ed . 

2.  THE  E.STIMATORS 

The  estimators  considered  in  this  paper  are  as 
fol lows: 

1.  weighted  mean  (VMEAN) 


WMEAN  = 

E-; 


where,  herr^  and  throughout  the  paper,  an 
iinlabelled  sum  runs  from  1  to  n.  This  estimator 
is  the  weighted  least  squares  estimator  for  the 
scpiared  weights.  In  addition,  if  the 
observations  are  from  normal  distributions  with 
the  same  location  but  different  standard 
deviations  n,  and  if  the  weights  are  optimally 

specified  (Wj=l/Oj),  this  estimator  is  not  only 

unbiased  but  uniform  minimum  variance  unbiased 
e.stimator,  the  maximum  likelihood  estimator, 
and  asymptotically  optimal  for  the  location. 

2.  weighted  median  (WMED) 

WMED  =  med  fX^  with  wt 

This  estimator  minimizes  the,  weighted  sum  of 
absolute  rioviations.  It  is  median  unbiased.  If 
the  distribution  is  double  exponential  (Laplamj, 
this  estimator  is  maximum  likelihood  and  has 
maximum  efficiency. 

3.  Huber’s  weighted  M-eslimator  (WHIS) 

Wins  is  the  implicit  solution  of  p  iti 


^  w  ji/j(  w  ^ 


n 


where 

f  I  .  S  i  f  z-'l  .  S 
i/)f  z)  =  j  i  r  1  7, 1<1  .  s 

I  -  1  .  if  7'  -  I  ,  ^ 
and  o  is  defined  iteratively  by 

mpcl  iWj  [X|-;j]  1 

"  "  .h7'')4 


wliere  .A74"i  is  the  median  of  1?.!,  for  Z  a 
standar<i  normal  random  vaiiable  as  s'lggesled  bv 
Hampel  in  Andrews  nt  al  (1972).  This  is  the 
weighteci  analog  of  Huber  s  est  imalor  with  value  c 
of  l-S,  hence  the  ahbr  i  e  v  i  a  t  I  on  Will 'i  ,  For  the 
equal  weights  case,  it  Is  the  maximum  likelihood 
estimate  f  r>r  the  least  Informative  fl  I  s  t  r  i  bu  t  i  oti 
nnri  it  minimizes  the  Fisher  information  f  Huboi  . 
1981)  . 
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4.  weighted  pseudo*tnedlan  (V^^tE^) 


WPMED  =  med 


w  X  +w  X 


with  wL . =  w ^+w^ 


This  estimator  minimizes  the  absolute  value  of 
the  pairwise  sums  given  by: 

^  I  w^(Xj-y)  +  Wj(Xj-v)  I 

This  is  the  weighted  median  of  the  weighted  Walsh 
averages  and  reduces  in  the  equal  weight  case  to 
the  median  of  the  Walsh  averages,  the 
Hodges -Lehmann  estimator  associated  with  the 
Wilcoxon  signed  rank  statistic.  As  such  it 
relies  on  the  symmetry  of  the  distribution  about 
the  unknown  parameter.  For  equal  weights,  it  is 
an  asymptotically  efficient  R-estimate  for  the 
logistic  distribution  (Huber,  1981), 

5.  Weighted  pairwise  Least  Absolute  Value  .Sum- 
Difference  (WLAVSD) 

WLAVSD  is  the  implicit  solution  to  the 
minimization  of  the  following  function  in  w: 

I  Wj(Xj-p)  +  w,(x.-)j)  I 


El  w.(x,-li)  -  w.(x.- 

'll  I  j 


1.  Normal  (NOR) 

2.  Contaminated  normal  (CNOR)  --  20% 
contamination  of  a  normal  distribution  with 
another  normal  with  the  same  mean  and  three  times 
the  standard  deviation. 

3.  Double  exponential  (DKXP) 

4.  Logistic  (LOST) 

5.  Uniform  (UNIF) 

All  were  selected  to  have  theoretical  variances 
of  t.  One  thousand  replications  of  each  sample 
of  size  n  were  performed.  In  order  to  evaluate 
estimators,  for  each  sample  of  size  n,  the. 
estlmator  Is  calculated.  For  the  thousand 
estimators  of  the  known  quantity,  the  variance  is 
calculated  and  used  to  compare  estimators. 

4.  CONSTDFRATTONS  FOR  WFIGHTED  MEDIANS 

Because  several  of  the  estimators  Involve 
weighted  medians,  Important  choices  confront  one 
when  considering  the  small-sample  behavior  of 
such  estimators.  In  large  samples  these 
cons iderat Ions  are  of  little  importance,  in 
that  the  estimators  are  asymptotically 
equivalent.  But  the  choice  Is  quite  Important  In 
small  samples.  The  Issue  is  whether  or  not  one 
should  Interpolate  to  obtain  the  weighted  median, 
and,  if  so,  which  one  of  a  number  of 
interpolations  to  employ.  With  arbitrary  weights 
or  a  different  number  of  points,  differences 
become  apparent.  Define  the  weighted  empirical 
(sample)  distribution  f\mctlon  as  follows; 


This  also  minimizes  the  weighted  stim  of  the 
ordered  absolute  residuals.  it  can  be  calculated 
as  a  weighted  least  absolute  value  minimization 

problem  on  the  n^  sums  and  differences,  with 
weights;  hence,  It  Is  very  computationally 
intensive.  In  the  case  of  equal  weights.  WLAVSD 
reduces  to  the  median  of  the  Walsh  averages,  and 
hence  can  be  thought  of,  along  with  WPMF.D,  as  a 
gene ra 1 i za t  ion  of  PMRD. 

These  five  are  the  weighted  estimators 
studied  here.  ITie  estimators  corresponding  to 
the  equal  weights  are  denoted  with  the  prefix  W 
emitted:  MEAN,  MED,  HIS,  and  PMED.  Admittedly 

there  are  other  appropriate  estimators  which  one 
might  have  also  Included.  Of  particular  interest 
in  this  study  is  the  examination  of  the 
one-sample  behavior  of  estimators  whicfi  will  f>e 
generallzable  to  the  regression  setting. 


3.  .SIMLI.ATIONS 

Simulations  were  performed  to  evaluate  the 
behavior  of  t  liose  vari<'>us  estimators  under  the 
corr<^*(.l  weights  artd  also  under  1  ric.fir  r  er:  i  ojies. 

Denot  ing  the  minimum  of  the  ri  welglits  by  1 
and  file  maximum  bv  R  f >  11 .  the  following  weight 

S(. hemes  we.e  userl,  where  w  .  -  1  /  u  ,  foi  the  known 

I  1 

stafKlard  Tb- v  f  al  /rjris  n.: 

1 

1  Equal  '  -  all  weights  equal  to  1  (  R -  1  ) . 

2.  Extreme  -*  half  the  weights  at  ),  half  at  R. 
3  filiform  --  equally  Sf^a^ed,  1  to  K 

I'he  following  '1  i  s  t  r  i  bu  t  i  on  s  were  user!  to 
generate  ps  einb)  -  r  arnlnm  numt>ers  using  t  lie  IMS!, 
s  I  a  f  I  s  t  i  f.a  1  routines  M  MSI, .  1  ')82  )  : 


for  Xj<x<Xj 
for  x>x 


j  =  l . n-1 


where  s  .=  W  .  /  W  and  W  ,  =  >  w , .  Th  1  s  funr.t  Ion 
)  in  ]  ^  I 

i  =  l 

is  a  discrete,  step  funr.t  ion.  It.  might  be 
expecle<l  that  a  continuous  version  might  out¬ 
perform  it.  Consider  the  following  possible 
int  erpo  I  at  ion  scheme.s.  iia.sed  on  the  sample 
weighted  distribution  function  E^^fx).  where 

without  Ujss  of  generality,  the  data  are 

...<x  and  w,  is  the  weight  f or i espond i ng  to  x.. 
n  i  _  ^  1 

A  .  si  mp  1  e -  always  use  E  (  .  3  1  ,  If  I '  i  x  )  ~  ' 

'  n  Ti 

for  X  <x''x.  then  by  convention  the  simple 

1  i  +  1 

median  Is  (x.tx..,)/2. 

B.  mid-data  -  using  the  sample 

f|  i  s  I  r  i  but  i  on  f  unr  t  i  fm  and  p  1  r>t  I  i  ng  only  the 
midpoints  in  t  lie  lK>rjzontal  line  segment.s, 
llfiearize;  i.  e.,  plot  the  points  in  the  tal»lo 


s  ,  I 
1 1  - 1  n 


I  ! 


651 


and  connect  by  line  segments.  The  mid-data 
median  Is  the  unique  Inverse  at  .5. 

C.  mid-weight  -  using  the  sample  distribution 

functions  and  plotting  only  the  midpoints  of  the 
vertical  line  segments,  linearize.  In  other 
words,  use  the  table  below  and  connect  by  line 
segments . 


X 

’'1 

^^2 

_ 

X 

n 

s 

*1 

2 

S1+S2 

^2^"3 

s  ,+s 
n-1  n 

2 

2 

2 

The  mid-weight  median  Is  the  Inverse  at  .5. 

D.  mixed  ---  using  the  sample  distribution 
function  and  plotting  the  midpoints  of  the 
horizontal  and  vertical  line  segments,  linearize; 
l.e..,  merge  the  above  two  tables 


r - 

-  -  .  .  _ 

— 

r  ■  “  “1 

X 

’'i 

Xj+X^ 

2 

"2 

2  " 

X 

n 

S1FS2 

"3 

s  ,+s 
n- 1  n 

2 

ro 

2 

2 

and  connect  by  line  segments  and  take  the  inverse 
at  . 5  to  obtain  the  mixed  median. 

Note  in  the  equal-weight  case  that  the  usual 
median  convention  is  obtained  only  in  the  simple 
and  mixed  cases  and  not  In  general  in  either  the 
mid-data  for  odd  n  nor  the  mid-weight  cases  for 
tied  x's.  nifferences  between  these  two  flawed 
interpolations  and  the  mixed  interpolation  median 
are  illustrated  in  Flgure.s  1  and  2.  Table  1 
reports  the  variances  associated  with  the 
simple  versus  mixed  interpolated  weighted  medians 
using  1000  simulations  for  n  =  10  and  R  =  4  for 
the  equally  spaced  weights.  Note  that  the 
mixed  median  variance  is  10%  smaller  than  the 
simple  for  NOR,  CNOR,  and  DF.XP  and  even  more  fo: 
ONIF.  Such  superiority  of  the  mixed  median 
cannot  be  ignored  in  the  calculation  of  welglilo<i 
medians  and  hence  in  all  simulations  that  follow, 
the  mixed  interpolated  median  is  used. 


TARI.F.  1:  VARIANCF.S  (xionoo)  FOR  F.QUAI.l.Y  RI’ACKI) 
WF.iGirrS  (R-41  FOR  SlNri.F,  AND  M I XFP 
INTF.Rl'ni.ATF.n  MF.DIANS  (n=10) 


NOR 

GNDR 

DEXP 

LGST 

UNIF 

SIMPI.K  MEDIAN 

2  142 

1272 

1  1  S3 

1699 

34  2  7 

MIXED  MEDIAN 

!9ri5 

I  10(1 

1  04  1 

1  SRO 

29  79 

S.  COMPARISON  OF  TIIF  KSflMAToRS 

For  eai h  sample  of  size  n,  all  0  estimatois 
MF.AN  111'')  I’MF.I)  MKl)  WMF. AN  WII 1 1  Wl’MFI)  WI.AVfl)  WMF.i; 


were  calculated.  From  the  1000  replications,  the 
mean  and  variance  (based  on  the  known  location) 
were  calculated  and  reported  In  the  tables  that 
follow.  Note  that  simulations  were  carried  out 
for  n=10  and  n=20.  In  addition,  the  asymptotic 
variances  where  known  are  also  reported. 

Consider  Table  2  which  reports  the  variances  of 
the  estimators  in  the  equal  weights  case.  Note 
for  this  unweighted  case  that  the  MEAN  performs 
quite  well  for  the  distributions  NOR  and  UNIF. 
Furthermore,  the  variances  for  MEAN  do  not  vary 
much  for  the  other  distributions.  For  the 
heavy-tailed  distributions  CNOR,  DEXP,  and  LOST, 
the  estimators  MED  and  PHED  appear  to  perform 
quite  well,  with  MED  doing  slightly  better  for 
DEXP  and  PMED  for  LGST  and  CNOR.  Comparison  of 
the  behavior  cf  the  estimators  for  n=10  and  n=20 
and  the  asymptotic  variance  is  Instructive  in 
evaluating  the  differences  between  the 
small-sample  and  the  asymptotic  performance  of 
these  estimators.  In  Table  2,  the  approximate 
standard  deviations  associated  with  the  mean  for 
the  dlstr<butlons  NOR,  CNOR,  DEXP,  LGST,  and  UNIF 
depend  on  thi.  kurtosis  of  the  distribution;  they 
are  (xlOOOn)  45,  66,  70,  57,  and  28  respec . ive ly . 
Note  that  these  standard  deviations  values  do  not 
depend  on  n,  and,  in  general,  cannot  be  expected 
to  converge  to  the  reported  asymptotic  variance 
as  n,  the  size  of  the  sample,  tends  to  infinity. 
It  is  noteworthy  tliat  small-sample  variances  for 
MED  are  quite  different  from  the  asymptotic 
variance,  and  to  a  lesser  extent  for  PMED.  For 
the  medians  and,  to  a  lesser  extent,  the 
pseudo-medians,  it  is  conjectured  that  as  n 
increases  the  sample  variances  associated  with 
the.  repl  ical  io""  tend  to  decrease  to  valvies  that 
vary  about  the  asymptotic  variances  reported  in 
the  table.  Also  included  in  this  table  are  the 
results  of  the  simulations  reported  by  Andrews  et 
al  (l')72).  A  quick  glance  confirms  that  similar 
results  are  obtained  here  as  In  their  st\idy.  For 
a  related  study  of  the  small-sample  behavior  of 
the  associated  tests  for  the  estimators  MF.AN, 

HKD,  and  PMED  for  the  distributions  NOR,  DEXP, 
I.G.ST,  tiNIF,  and  the  Cauchy,  there  Is  the  report 
of  the  empirical  powers  in  Randles  and  Wolfe 
(  1979) . 

Consider  Table  3  In  which,  for  extreme  weights 
and  R=4,  the  variances  are  presented  for  the 
unweighted  and  weighted  estimators.  Among  the 
weighted  estimators  (and  hence  all  estimators 
since  each  weighted  estimator  outperforms  its 
unweighted  ur.uiog),  note  that  WMEAN  is  iest  tor 
NOR  and  UNIF,  that  WMED  is  best  for  DEXP.  and 
that  WLAV.SD  and,  to  a  lesser  ex'ent,  WPMED 
perform  well  for  LGST,  as  one  might  exie.  t  from 
I'able  2.  If  attention  is  inriied  to  the 
estimators  which  ignore  the  weights  (the 
estimators  without  the  prefix  then  MF.AN  Is 

very  poorly  behaved,  even  for  the  NOR 
distribution.  Also,  MED  Is  uniformly  best  of  t he 
four  unweighted  estimators  for  eaeh  of  the 
d ist r ibut ions  The  asympt  tic  variance  rows  ate 
obtained  using  the  asymptotii.  distribution  fc>r 
the  weighted  esti.eator  Itased  on  a  fixed  numhei, 
n,  of  weights  and  letting  the  number  of 
nltservat  li.ns ,  k  ,  a!  eat  it  weight  I  etitl  lo  iniinity 
atitl  mu  1 1  i  j>  I  y  i iig  t  ht-  as ympt  ftt  i  t  v  a  t  I  am  e  i,y 
IfliUtnii.  Nt't  e  tiiat  fttt  t.iis  extteme  weight  t  ast' 
tills  asym|*t'itit  vatlnatf*  resull  does  not  riepemi 
'"'I  n.  As  in  'I'nl,  1  e  _  '(p|)  ,,,1,1  tiow  WMl.P  have 
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small-sample  variances  which  are  much  larger  than 
the  asymptotic  ones  for  DEXP. 

For  Table  4,  for  equally-spaced  weights  and 
R=4,  note  that  the  variances  are  generally 
smaller  than  observed  In  the  extreme  weight  case 
for  the  same  R  In  Table  3,  as  one  might  have 
expected.  Among  the  weighted  estimators,  WMEAN 
is  best  for  NOR  and  UNIF  and  poorly  behaved  for 
DEXP  and  CNOR.  WMED  performs  fairly  well  for 
DEXP  for  n=10  and  n=20,  although  not  as  well  as 
the  asymptotic  result  might  have  one  believe. 

Note  that  In  this  case  of  equally  spaced  we 'ghts 
the  asymptotic  variances  are  different  only  for 
WMED  for  n=10  versus  n=20.  WLAVSD  performs  well 
for  LOST,  CNOR  and  DEXP.  It  is  no  surprise  that 
WMEAN  Is  not  robust  to  the  heavy-tailed 
distributions  whereas  both  WMED,  WLAVSD  and  WPMED 
are.  Among  the  estimators  Ich  Ignore  the 
weights,  MED  does  well  for  .lOR,  DEXP,  LOST  and 
PMED  performs  well  for  NOk  and  UNIF.  As  In  Table 
3,  the  MEAN  is  dismally  behaved  for  the 
mlsspeclfled  weights.  It  Is  Interesting  for  UNIF 
that  WMED  has  larger  variance  than  PMED  and  H15. 

6.  KANTOROVICH'S  INEQUALITY 

Cressie  (1980)  considered  weighted 
M-estlmatlon  and  Its  large-sample  behavior 
relative  to  weight  mlsspeclf icat Ion .  As 
mentioned  there,  for  known  weights  Wj, 

Kantorovich  (1948)  proved  that  the  measure,  r,  of 
inefficiency,  given  by  the  ratio  of  the  variances 
of  the  unweighted  MEAN  to  the  optimally  weighted 
WMEAN,  Is  bounded  by; 

2  2 

r(MEAN, WMEAN)  i  1  +  (1) 

4R'^ 

where  R  =  M/m,  M  =  max(w^0j^j  and  ro  =  mlnjw^Oj).  • 

For  the  median,  Tukey  has  shown,  as  mentioned  by 
Cressie,  that 

2 

r(MEn,WMED)  <  1  +  (2) 


where  R  Is  as  above.  (Cressie  also  reports  a 
Kantorovich  Inequality  for  H-estlmates,  but  It  Is 
Inappropriate  here  In  that  the  weighted 
M-estlraates  considered  by  him  are  either  In  the 
homogeneous  variance  case  or  have  a  psl  function 
which  Is  homogeneous  with  regard  to  Internal 
weights  and  unfortunately  the  Influence  function 
for  the  H15  Is  not  homogeneous.) 

For  R  =  4,  the  Kantorovich  bounds  In  equations 
(1)  and  (2)  are  4.516  and  1.5625.  These  can  be 
compared  with  the  small-sample  Inefficiencies  as 
observed  by  the  ratios  of  the  estimated  variances 
of  Tables  3  and  4.  These  ratios  are  reported  In 
Table  5  (xlOOO).  As  Is  no  surprise,  the 
small-sample  Inefficiency  ratios  are  larger  for 
the  extreme  weights  than  for  the  equally-spaced 
ones;  In  fact  It  Is  exactly  the  extreme  weights 
that  give  the  bound  for  the  Inequality.  Il  Is 
Interesting  to  note  that  while  MED  has  the  best 
observed  efficiencies  across  the  five 
distributions,  PMED  Is  somewhat  competitive, 
especially  In  the  equally  spaced  weighting 
situation.  The  choice  of  which  estimator  ^o  "so 


clearly  depends  upon  the  quantification  of  the 
likely  weight  mlsspeclflcatlon  as  well  as  the 
possibility  of  outliers  or  deviation  from 
believed  distribution.  For  small  R,  weights  play 
a  less  important  role,  and  for  large  R,  the 
possibility  of  mlsspeclflcatlon  becomes  crucial 
In  the  selection  of  a  one-sample  estimator.  As 
advocated  by  Tukey  and  as  mentioned  by  Cressie 
(1980)  for  R<2  the  mean  has  some  advantage  and 
only  for  larger  R  does  the  median  assert  itself. 
This  is  certainly  seen  In  this  study  for  R=4  with 
these  two  weighting  schemes.  Note  for  equally 
spaced  weights  and  R=4  that  PMED  competes  quite 
well  with  MED  when  one  takes  Into  account  the 
poor  efficiency  behavior  of  the  MED  to  PMED  for 
NOR,  CNOR,  LOST  and  UNIF  (recall  the  asymptotic 
relative  efficiencies  of  MED  to  PMED  for  these 
four  distributions  are  2/3,  .79,  3/4  and  1/3, 
respectively) . 

7.  CONCLUSIONS 

Weights,  even  If  hidden  in  an  estimation 
problem,  can  play  an  Important  role  In  the 
selection  of  the  estimator  Failure  to  recognize 
the  existence  of  unequal  .  ;lits  can  be 
disastrous  even  if  the  dlsM  ibution  Is  correctly 
Identified.  In  particular,  the  mean  Is  terribly 
non-robust  with  respect  to  weight 
mlsspeclflcatlon,  even  with  data  that  are  normal. 
The  median,  on  the  other  hand,  while  robust  with 
regard  to  weights  and  with  regard  to  outliers, 
does  have  some  poor  efficiency  properties  If  the 
distribution  is  not  heavy-tailed.  A  compromise 
might  be  to  use  the  Hodges -Lehmann,  pseudo-median 
estimator  (whose  weighted  analogs  are  WLAVSD  and 
WPMED),  which  has  reasonable  robustness  to 
outliers  and  distribution  and  which  is  more 
robust  to  weight  raisspecificatl  n  than  either  the 
mean  or  the  Huber  M-estlmate. 

The  small-sample  behavior  differs  from  the 
known  asymptotic  results  for  several  of  the 
studied  situations.  In  particular,  one 
surprising  result  Is  the  margin  by  which  the 
small -sample  estimated  variance  differs  In  the 
case  of  the  median  from  the  asymptotic 
prediction.  This  substantial  discrepancy  Is 
expected  to  widen  If  one  were  to  examine 
related  estimators  in  the  simple  linear 
regression  problem.  It  is  also  Interesting  that 
there  Is  a  difference  In  the  other  direction  with 
regard  to  the  normal  and  uniform  distributions; 
namely,  that  the  observed  small-sample  variances 
are  smaller  than  expected  from  the  asymptotic 
prediction.  These  results  may  not  seem  so 
surprising  If  for  the  median  one  recalls  that  the 
asymptotic  approximation  Is  Influenced  by  the 
smoothness  of  the  density  In  the  neighborhood  of 
the  theoretical  median. 

In  conclusion,  it  appears  that  there  Is  no  optimal 
resolution  concerning  the  selection  of  an 
estimator  that  Is  robust  with  respect  to  weights 
as  well  as  to  outliers.  The  choice  of  an 
estimator  depends  upon  the  weights,  their  spacing 
and  range  R  (even  If  not  Identified),  and  the 
behavior  of  the  distribution  In  the  tails. 

The  authors  gratefully  acknowledge  the 
technical  and  editorial  assistance  of  M.  Hodges 
of  the  Division  of  Computer  Research  and 
Technology. 


653 


RF.FERF.NCES 

Andi<>ws,  D.F.,  Llckel,  P.J.,  Hampel,  F.R.,  Huber, 
P.J.,  Rogers,  W.H.,  and  Tukey,  J.W.  (1972). 
Robust  Estimates  of  Location-.  Survey  and 
Advances.  Princeton  University  Press: 
Princeton,  NJ. 

Cressle,  N.  (1980).  Weighted  M-estlroat Ion  In  the 
presence  of  unequal  scale.  Stat ist ica 
Neerlandica  34,  19-32. 

Huber,  P.  J.  (1981).  Robust  Statistics.  .John 
Wiley  and  Sons;  New  York,  New  York. 

IMSL  (1982).  IHSL  Library.  TMSI.:  Houston, 

Texas . 

Jaeckel,  L.  (1972).  Estimating  regression 

coefficients  by  minimizing  the  dispersion  of 
the  residuals.  Ann.  Math.  Statist.  43,  1449- 
1458. 

Kantorovich,  1,.  (1948).  Functional  analysis  and 
applied  mathematics.  Usphehi  Mat.  Rank.  3, 
89-185. 

Randles,  R.  H.  and  Wolfe,  D.  A.  (1979). 

Introduct ion  to  the  Theory  of  Nonparametr ic 
Statistics.  John  Wiley  and  Sons;  New  York, 
New  York. 

Scholz,  F.-W.  (1978).  Weighted  median  regression 
estimates.  Ann.  Statist.  6,  603-609. 


TABI£  4:  VARIANCES  ( x lOOOOn) - -EQUALLY 

SPACED  WTS  (R=4) 


NOR 

CNOR 

DEXP 

LOST 

UNIF 

MEAN 

Cn=10) 

2863 

2679 

2824 

3103 

2991 

(n=20) 

2711 

2487 

2765 

2742 

2684 

(aayis) 

2836 

2836 

2836 

2836 

2836 

H15 

(n=lO) 

2379 

1582 

1648 

2143 

2606 

(n=20) 

2135 

1373 

1563 

1976 

2616 

PNED 

(n=10) 

2299 

1461 

1517 

2070 

2880 

(n=20) 

2101 

1292 

1389 

1903 

2702 

MED 

(n=10) 

2392 

1283 

1238 

1964 

3599 

{n=20) 

2421 

1281 

1052 

1996 

4167 

(aaym) 

2513 

1287 

800 

1945 

4800 

WMEAN 

(n=10) 

1394 

1421 

1289 

1507 

1396 

(n=20) 

1351 

1432 

1488 

1378 

1360 

(asya) 

1395 

1395 

1395 

1395 

1395 

VH15 

(n=I0) 

1467 

1049 

1038 

1427 

1555 

(n-20) 

1395 

979 

1176 

1330 

1479 

WPHED 

(n=10) 

1556 

1047 

1044 

1463 

1784 

(n=20) 

1492 

986 

1107 

1369 

1664 

WLAVSD 

(n=>10) 

1510 

1039 

996 

1423 

1670 

(n-20) 

1452 

973 

1084 

1313 

1586 

VHED 

(n=10) 

1978 

1089 

1001 

1661 

2943 

(a8ya,n»10) 

2192 

1122 

698 

1696 

4166 

(n=20) 

2044 

1110 

1002 

1625 

3516 

(a8yni,n*20) 

2219 

1136 

706 

1718 

4238 

TABLE  2:  VARIANCES  (xlOOOn)  -  EQUAL  WTS 


NOR 

CNOR 

DEXP 

LGST 

UNIF 

MEAN 

n«10 

1011 

978 

958 

1095 

1038 

n=20 

1013 

1029 

1112 

1009 

993 

PR  INC* 

1000 

1038 

1050 

ASYMP 

1000 

1000 

1000 

1000 

1000 

HI5 

n=l0 

1091 

735 

765 

1014 

1177 

n-20 

1071 

678 

881 

946 

1102 

PR INC* 

1031 

690 

820 

PMED 

n=I0 

1121 

706 

723 

1039 

1300 

n=20 

1077 

662 

938 

1205 

PR INC* 

1063 

673 

745 

ASYMP 

104/ 

635 

667 

912 

1000 

MED 

n=10 

1476 

723 

723 

1183 

2329 

n=20 

1515 

735 

743 

1144 

2612 

PR INC* 

1366 

708 

665 

ASYMP 

1571 

804 

500 

1216 

3000 

♦Values  as  reported  in  the  Princeton  study 
by  Andrews  et  al  (1972):  n=10  for  NOR  and 

CNOR,  n=20  for  DEXP. 


TABLE  3:  VARIANCES  ( xlOOOOn )- -EXTREME  WTS  CK=4) 


NOR 

CNOR 

DEXP 

LGST 

UNIF 

MEAN 

(n=10) 

5471 

5084 

5467 

5892 

5765 

(n=20) 

5502 

5092 

5259 

5491 

5174 

(asym) 

5313 

5313 

5313 

5313 

5313 

H15 

(n=10) 

4145 

2568 

2919 

3709 

5438 

(n=20) 

3688 

2103 

2288 

3146 

4955 

PMED 

(n=I0) 

35P1 

2149 

2503 

3334 

5041 

(n=20) 

3159 

1737 

1950 

2742 

4101 

MED 

(n»I0) 

2452 

1486 

1447 

2141 

3654 

(n=20) 

2476 

1295 

1294 

1955 

3760 

(saya) 

2513 

1287 

800 

1945 

4800 

VMEAN 

(n»10) 

1155 

1190 

1067 

1266 

U55 

(n=20) 

1235 

1157 

1228 

1194 

1228 

(asya) 

1177 

1177 

1177 

1177 

1177 

VH15 

(n=10) 

1209 

690 

875 

1224 

1274 

(n=20) 

1296 

803 

975 

1132 

1327 

WPMED 

(n=10) 

1323 

913 

895 

1280 

1469 

(n=20) 

1367 

626 

965 

1215 

1523 

WUVSD 

(n=10) 

1225 

664 

849 

1218 

1363 

(n=20) 

1294 

784 

933 

1133 

1415 

VHED 

(n=I0) 

1589 

898 

861 

1483 

2361 

(n=20) 

181'i 

899 

921 

1359 

2797 

(asya) 

1848 

946 

588 

1430 

3530 

TABLE  5: 

OBSERVED  MEASURE  OF 

INEFFICIENCY 

(xlOOO) 

(n=10) 

NOR 

CNOR 

DEXP 

LGST 

UNIF 

EXTREME 

WTS  (R=A) 

MEAN 

4739 

4271 

5124 

4652 

4993 

HIS 

3429 

2687 

3335 

3031 

4269 

PMED 

2708 

2353 

2797 

2605 

3386 

MED 

1544 

16^1 

1444 

1547 

EQUALLY  SPACED  WTS  (R»4) 

MEAN 

2053 

1685 

2191 

2059 

2143 

HIS 

1622 

1509 

1588 

1502 

1805 

PMED 

1477 

1396 

1453 

1415 

1614 

MED 

1209 

1178 

1238 

1183 

1223 
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APPROXIMATIONS  OF  THE  WILCOXON  RANK  SUM  TEST  IN  SMALL  SAMPLES  WITH  LOTS  OF  TIES 


Arthur  R.  Silverberg,  Food  and  Drug  Administration 
Abstract 


The  Wilcoxon-Mann-Whitney  rank  sum  test  for 
two  independent  samples  is  frequently  used 
with  data  having  ties.  Although  there  are 
computer  programs  to  calculate  the  exact 
randomization  test,  even  for  small  samples, 
computer  packages  use  approximations  based 
upon  the  normal.  Student's  t  distribution,  or 
the  distribution  without  ties.  For  each  of 
the  small  sample  sizes  considered  in  this 
paper,  all  distributions  of  obtaining  ties 
were  considered,  as  well  as  all  permutations 
of  the  ordering  of  the  ties.  The  exact 
distribution  with  ties  was  compared  to,  the 
tsh'ilateH  '’ai'.ie  without  tics,  n"''P'3l 
approximations  with  and  without  continuity 
corrections,  and  Edgeworth  expansions  with  and 
without  continuity  corrections.  The  purpose 
of  looking  at  all  these  distributions  was  to 
quantify  the  accuracy  of  the  common 
approximations,  rather  than  to  develop  any  new 
approximations . 


1 .  Introduction 

Recommendations  for  approximations  to  the 
Wilcoxon-Mann-Whitney  rank  sum  statistic  in 
the  case  of  ties  vary.  Conover  [2]  suggests 
using  the  normal  approximation  with  no 
continuity  correction  when  there  are  ties. 
Klotz  [6]  found  that  the  Edgeworth 
approximation  gave  only  a  small  improvement 
over  the  normal  approximation,  both 
approximations  used  a  continuity  correction. 
Hollander  and  Wolfe  [4]  suggest  the  use  of  the 
exact  tables  for  no  ties,  when  ties  are 
present  and  the  samples  are  small.  When  the 
samples  are  large,  the  normal  approximation 
with  no  continuity  correction  is  suggested. 
When  the  largest  proportion  of  sample  values 
in  a  tied  category  is  not  close  to  1, 

Lehmann  [7]  suggests  the  use  of  the  continuity 
corrected  normal  approximation.  Emerson  and 
Moses  [3]  recommend  using  exact  calculations 
unless  both  sample  sizes  are  at  least  10. 

They  state  that  the  normal  approximation  is 
unreliable  when  over  half  the  observations 
fall  into  one  category.  Although  Emerson  and 
Moses  state  that  the  use  of  the  continuity 
correction  makes  little  difference,  they 
recommend  against  the  use  of  the  continuity 
correction  because  of  the  unequal  spacing  of 
the  statistic, 

Klotz  [6]  provides  an  algorithm  and  flow 
chart  for  calculating  the  exact  probability  of 
the  Wilcoxon-Mann-Whitney  statistic.  A 
network  algorithm  for  the  Exact  Wilcoxon-Mann- 
Whitney  test  with  ties  is  found  in  Mehta, 

Patel  and  Tsiatis  [9],  This  paper  contains 
some  typographic  errors.  A  fuller  explanation 
of  the  network  algorithm,  in  the  case  of  2xk 
contingency  tables,  is  found  in  Mehta  and 
Patel  [8], 

Major  computer  packages  use  a  variety  of 
approximations.  BMDP  [1]  uses  a  normal 


approximation.  It  is  not  clear  from  the 
doumentation  if  a  continuity  correction  is 
used.  SAS  [10]  provides  the  the  normal 
approximation  with  and  without  continuity 
correction,  and  a  t-distribution  approximation 
based  on  n,+n_-l  degrees  of  freeaom. 

SPSS-X  [12J  gives  a  normal  approximation  with 
no  continuity  correction  and  for  nj+n2<30  the 
tabular  p-value  from  a  table  assuming^no  ties. 
IMSL  [5]  uses  three  normal  approximations 
without  the  continuity  correction.  Ties  are 
broken  to  give  both  the  highest  nd  lowest 
possible  statistics.  The  approximtion  is  then 
applied  to  these  cwo  stauistics  <»s  well  as  Uie 
original  data  with  ties. 

This  author  has  written  a  computer  program 
to  calculate  the  exact  value  of  the 
Wilcoxon-Mann-Whitney  statistic  with  ties  for 
IBM-PC  compatible  computer  based  upon  the 
Mehta,  Patel  and  Tsiatis  algorithm.  The 
program  was  written  in  compiled  Turbo  Pascal 
Version  3.0  and  is  available  to  interested 
parties  who  mail  a  formatted  diskette  (either 
3.5  or  5.25  inch)  in  a  self-addressed  mailer 
to  the  author. 

2.  Exact  Distribution  Under 

Let  us  denote  the  smaller  sample  size  by  n. 
and  the  larger  sample  size  by  n2,n^+n2=N.  Let 

"l 

W=  I  R.  when  R,  is  the  rank  of  observation  i 
i=l 

from  the  smaller  sample  and  let  t.  j=l'--c  be 
the  number  of  observations  from  b3th  samples 
that  are  tied  in  category  j,  R.<R.^j^.  It  is 
well  known  that  ^  ^ 


and 


n  n  F  t.(t.  -I) 

o  =  [N-H-j  =  l  ^  ^  ] 

12  N(N-I) 


It  can  be  shown  that 


n  n  (n  -n  )  c 
4N(N-l)(N-2)  j  =  l  2  J 


2 


-1)1 


This  formula  for  seems  new.  It  is  easy  to 
see  that  ^,=0  when  0=0,  or  if  t.=t  ^  .  as 

pointed  out  by  Klotz  [6].  J  ^  J 

The  formula  for  as  given  at  the  bottom 
of  the  next  page  seems  to  be  new. 

For  each  total  sample  size  considered  N, 
all  distributions  of  ties  were  considered. 

The  number  of  such  distributions,  p(N)  is  the 
number  of  unrestricted  partitions  of  N.  For 
example  for  N=5,  p(5)=7  since 
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1)  1+1+1+1+1=5 

2)  l+l+l+2=5 

3)  l+l+3=5 

4)  l+2+2=5 

5)  1+4=5 

6)  2+3=5 

7)  5=5 


For  small  N, 


N 

4 

5 

6 

7 

8 

9 

10 

11 

P(n) 

5 

7 

11 

15 

22 

30 

42 

56 

N 

12 

13 

14 

15 

16 

17 

13 

19 

P(n) 

77 

101 

135 

176 

231 

297 

385 

490 

Ties  may  be  in  any  order.  So  for  each 
distribution  of  ties,  we  considered  all 
c!/((*t^=l)!  X  X  (ttt^=N))  permutations  of 
the  ties.  For  example  t  =t-=l  and  t,=3 
therefore,  3!/(2!  x  0!  x'^li'^x  0!  x  0T)=3. 

1) 

2)  1  3  3  3  5 

3)  2  2  2  4  5 

The  Wilcoxon-Mann-Whitney  rank  sum 
statistic  was  computed  for  all  possible 
samples  of  size  rij^Sn--  The  exact 
probabilities  were  compared  to  a  number  of 
approximations  explained  below  when  either  the 
exact  value  or  the  respective  approximation 
was  less  than  . 1 . 


3.  Approximations 

All  comparisons  were  for  one-sided  tests 
and  p-values  were  for  the  alternative 
hypothesis  of  the  location  of  population  one 
smaller  than  that  of  population  two. 

Two  approximations  were  based  upon  the 
standard  Tables  that  assume  no  ties,  and  are 
denoted  as  approximations  T,  and  T_.  For 
approximation  T  ,  look  up  the  statistic  in  the 
Table  and  find  the  p-value  corresponding  to 
that  integer,  or  the  next  larger  value  if  the 
statistic  is  non-integral  (it  must  always  be 
either  an  integer  or  an  integer  plus  one-half) 
because  of  ties.  T.  is  the  same  as  T.  except 
that  linear  interpolation  is  performed  for 
non-integral  values. 

Three  approximations  were  based  upon  the 
standard  Normal  distribution,  and  are  denoted 
as  approximations  N^,  and  N 


Approximation  N.  is  the  normal  approximation 
with  no  continuity  correction.  Approximation 
Nj  is  the  normal  approximation  with  the 
continuity  correction  based  upon  assuming  the 
lattice  spacing  A=l.  Smid  [11]  has  shown  that 
the  spacings  of  the  exact  distribution  of  the 
Wilcoxon-Mann-Whitney  statistic  is  a  multiple 
of  the  greatest  common  divisor  (gcd)  of 
.5  X  gcd{t. +t2,  •  •  •  j^+t  }■  Therefore  taking 
the  suggestion  of  Klotz  [8]  we  use 
.25  X  gcd{t.+t-, • • ■ ,t  _^>  as  a  continuity 
correction  In  N  ..  Se^have  the  following 
relationships  among  these  approximations, 

^o'^^Dcd  “I'^'^^ocd  depending  on 
whether  l<=5.5  x  gcdlt^+t',  •  ■  • 

Three  approximations  were  based '^upon  the 
Edgeworth  approximation,  E^,  E^^  and  E 
defined  in  an  analogous  way  to'*^Ny,  Nj^"and 
N  ..  Therefore,  E„<E,,  E-<E  ,  and  E,<=>E  , 

dS^inding  on  whethe?  1  «  9=^  1  gcd 

1<=>.5  X  gcd{t,+t2, • • •,t^_,+t^}. 

The  Edgeworth  approximation  is  given  by 


E(x)=S($x)-(K3/(6x2^-  ^) )  (x^-l)2(x) 
-(K4/{24x2^))(x^-3x)Z(x) 
-{K3^/(72k2^)(x^-10x^+15x)Z(x) 


4^x)  and  2(x)  are  the  normal  cumulative 
distribution  function  and  the  probability 
distribution  respectively,  and  x.  are  the 
cumulants . 


4.  Accuracy  of  Approximations 

The  tables  give  the  maximum  absolute  error 
I (estim  .ted-actual) I ,  and  relative  error 
I ( (estimated-actual ) /actual) I .  The  sign  is 
given  or  +  when  the  largest  positive  error 
equals  the  largest  negative  error.  A  value  0 
means  that  either  all  probabilities  were 
estimated  correctly  or,  there  were  no 
estimated  and  no  exact  p-values  less  than  0.1. 
Computations  were  performed  on  a  VAX  8530 
using  a  Basic  language  compiler  at  the  Food 
and  Drug  Administration. 

Table  lA  show  the  largest  absolute  error 
over  all  partitions  when  either  the  true  or 
estimated  p-value  is  less  thn  or  equal  to  0.1. 
Even  for  the  sample  sizes  that  are  not  too 
small,  the  largest  errors  are  large.  The 
approximations  based  upon  the  standard  tables 
are  conservative  in  that  the  largest  absolute 
errors  tend  to  occur  when  estimated>actual . 


p  =n  n  (N+l)[n-n^(5N+7)-2N(N+l)1 
’  ^  240  ^  ^ 

ninjENCN+U-en^nj]  c 

- _  X  ^  t  (t/-l)(120R  (R,-(N+l))+3t/) 

240N(N-l)(N-2)(N-3)  j  =  l  J  J  9 


"l"2  ^  °  7 

-1  =  1  -*  X  [5(n  -l)(n,-l)  r  t.(t/-l) 

240N(N-l)(N-2)(N-3)  ^  j=l  J  J 

-42n  ^02- lOn  ^02 (N+ 1 ) (N^- 19N- 18 ) -N (N+ 1 ) (20N^+80N+ 13 ) 
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The  normal  and  Edgeworth  approximations  are 
not  conservative  for  absolute  errors  since  the 
largest  absolute  errors  tend  to  occur  when 
estimated<actual .  The  best  approximation  in 
terms  of  overall  absolute  error  seems  to  be, 

T2  when  r>j=n2,  T^  when  n^<n2  n.  =  l,2,  N  ,  when 
n^<n2  nj^=3,4,5,  and  E^^  when  n^^h2  h^>6? 
Approximation  Nq  gives  the  poorest  performance 
in  terms  of  overall  absolute  error. 

Table  IB  shows  the  largest  relative  error 
over  all  partitions  when  either  the  true  or 
estimated  p-value  is  less  than  or  equal  to 
0.1.  The  largest  relative  error  tends  to 
occur  when  the  actual  p-value  is  smallest,  as 
opposed  to  the  absolute  error  that  does  not 
have  this  tendency.  The  larger  sample  sizes 
tend  to  have  larger  relative  overall  errors. 
All  approximations  studied  are  conservative  in 
that  the  largest  relative  errors  tend  to  occur 
when  estimated>actual  for  n^>3.  The  best 
approximation  in  terms  of  overall  relative 
error  seems  to  be,  T^  for  i^=l,2, 
n.=3,4,  and  E„  for  n7>5.  The 
tna 


No_for 


..Q  nj^>5.  The  approximations 

that  gives  the  poorest  performance  in  terras  of 
overall  relative  error  seems  to  be  N.  for 
nj^=l,2,  and  T^  and  T2  for  n.>3. 

Table  2  shows  the'^largest  absolute  and 
relative  errors  for  the  partition  when  there 
are  to  ties  and  the  true  or  estimated  p-value 
is  less  than  or  equal  to  0.1.  The  table 
look-ups  of  course  have  zero  error  so  are 
excluded . 

Ihe  first  four  columns  of  Table  2  contain 
the  absolute  errors  over  the  no  ties  partition 
when  the  true  or  estimated  p-value  is  less 
than  or  equal  to  0. 1.  The  largest  error  of 
the  no  ties  partition  is  usually  at  least  an 
order  of  magnitude  smalller  than  the  errors 
shown  in  Table  lA.  The  two  normal 
approximations,  and  approximation  E.  tend  not 
to  be  conservative  in  that  the  largest  errors 
occur  when  estimated<actual .  Approximation 
Ej^=E  .  tends  to  be  conservative,  the  largest 
errors°occur  when  estimated>actual,  for  the 
larger  sample  sizes  given.  Approximation 
Ej^=E  .  is  the  better  of  the  two  normal  and 
two  Edgeworth  approximations  in  terms  of 
largest  absolute  error.  Appproximation 
gives  the  poorest  performance  in  terms  ot 
largest  abuolute  error  over  the  no  ties 
partition . 

The  last  four  columns  of  Table  2  contain 
the  largest  relative  errors  of  the  no  ties 
partition  when  the  true  or  estimated  p-value 
is  less  than  or  equal  to  0.1.  The  largest 
error  of  the  no  ties  parititon  is  frequently 
nearly  as  large  as  the  errors  shown  in 
Table  IB  and  in  fact  in  some  cases  the  largest 
error  in  Table  IB  was  from  the  no  ties 
partition.  The  two  normal  approximations  tend 
not  to  be  conservative  for  larger  samples  in 
that  the  largest  relative  errors  occur  when 
estimated<actual.  Approximation  E.  tends  to 
be  conservative  in  that  the  Ingest  relative 
errors  occur  when  estimated>actual . 
Approximation  E^=E  .  does  not  seem  to  be 
easily  classifiable  as  being  either 
conservative  or  not  conservative  in  terms  of 
largest  relative  error  for  the  no  ties 
partition.  The  Edgeworth  approximations  tend 


to  be  better  than  the  normal  approximations  in 
terms  of  largest  relative  error,  with  E^=E 
usually  better  than  E-.  Approximations 
gives  the  poorest  performance  for  n  =1  in 
terms  of  largest  relative  error  of  the  no  ties 
partition,  while  approximation  Nj^=N  gives 
the  poorest  performance  for  n.|>2.  ^ 

5.  Recommendations 

If  there  are  no  ties,  the  Edgeworth 
approximation  with  continuity  correction, 

Ej=E  .  is  recommended  when  tables  are  not 
avaiSaBle.  Since  none  of  the  approximations 
are  accurate  when  there  are  ties,  even  for 
moderate  size  samples,  the  calculation  of  the 
exact  probability  of  the  Wilcoxon-Mann-Whitney 
statistic  is  recommended. 
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^1 

Table  lA: 

^2 

Signed (Max | Estimated-Actual ( )  All 

"o  «1  Vd 

Partitions 

"o 

of  Ties 
"l 

E  , 
gcd 

8 

-.036 

-.036 

-.208 

n^=2 
-.  176 

-.  129 

-.  191 

-.  116 

-.125 

9 

+  .028 

+  .028 

-.  192 

-.  168 

-.103 

-.185 

-.  147 

-.  101 

10 

-.044 

-.044 

-.177 

-.  160 

-.137 

-.  166 

-.  145 

-.  134 

11 

-.036 

-.036 

-.  186 

-.  152 

-.  110 

-.180 

-.  141 

-.  108 

12 

-.060 

-.060 

-.226 

-.145 

-.  143 

-.153 

-.  134 

-.139 

13 

-.051 

-.051 

-.217 

-.  137 

-.  123 

-.177 

-.125 

-.  120 

14 

-.066 

-.066 

-.208 

-.  187 

-.162 

-.  187 

-.  137 

-.  157 

15 

+  .057 

+  .057 

-.200 

-.  182 

-.  168 

-.  181 

-.159 

-.  152 

16 

-.075 

-.075 

-.202 

-.  178 

-.  168 

-.  176 

-.  157 

-.  148 

17 

-.066 

-.066 

-.  186 

-.173 

-.165 

-.  174 

-.  155 

-.  143 

18 

+  .065 

+  .065 

-.218 

-.  169 

-.  162 

-.  166 

-.  153 

-.  147 

8 

+  .  107 

+  .  107 

-.277 

n  =3 
-.099 

-.  107 

-.  170 

-.099 

-.  108 

9 

+  .  190 

+  .  190 

-.255 

+  .  137 

-.  143 

-.163 

+  .  151 

-.  142 

10 

+  .  192 

+  .  192 

-.237 

-.205 

+  .  113 

-.212 

+  .146 

+  .127 

11 

+  .  194 

+  .  194 

-.221 

-.196 

-.126 

-.201 

+  .  138 

-.  128 

12 

+  .195 

+  .  195 

-.208 

-.  188 

-.  115 

-.  191 

-.  163 

+  .  115 

13 

+  .  196 

+  .196 

-.197 

-.  181 

-.  118 

-.  183 

-.161 

-.  121 

14 

+  .195 

+  .  195 

-.  187 

-.  173 

-.  140 

-.174 

-.157 

-.141 

15 

+  .196 

+  .  196 

-.  177 

-.  167 

+  .  120 

-.  172 

-.153 

+  .125 

16 

+  .195 

+  .  195 

-.  169 

-.  160 

-.130 

-.  158 

-.  149 

-.  132 

17 

+  .  194 

+  .194 

-.161 

-.  154 

-.  146 

159 

-.  144 

-.  147 

18 

+  .194 

+  .  194 

-.218 

-.  148 

-.  124 

-.  174 

-.  137 

-.  126 

8 

+  .100 

+  .100 

-.  193 

n  =4 
-.124 

-.  136 

-.  148 

-.  116 

-.130 

9 

+  .095 

+  .095 

-.  190 

-.  103 

-.  101 

-.  185 

-.095 

-.094 

10 

+  .  138 

+  .  107 

-.  181 

-.096 

-.  102 

-.170 

-.097 

-.  103 

11 

+  .  121 

+  .121 

-.271 

-.122 

-.129 

-.  165 

-.122 

-.130 

12 

+  194 

+  .194 

-.255 

-.147 

-.155 

-.  168 

+  .  139 

-.155 

13 

+  .225 

+  .200 

-.241 

-.216 

+  .110 

-.215 

+  .137 

+  .122 

14 

+  .204 

+  .204 

-.229 

-.208 

-.  126 

-.206 

+  .132 

-.  130 

15 

+  .229 

+  .208 

-.218 

-.201 

-.145 

-.198 

-.  175 

-.  147 

16 

+  .210 

+  .210 

-.208 

-.  194. 

-.  108 

-.  191 

-.  171 

-.  113 

17 

+  .230 

+  .212 

-.200 

-.  187 

-.125 

-.185 

-.168 

-.129 

18 

+  .213 

+  .213 

-.  192 

-.  181 

-.  140 

-.  178 

-.  164 

144 

10 

+  .  127 

+  .099 

-.  188 

n  =5 
-.  134^ 

-.  144 

-.  151 

-.  126 

-.  137 

11 

+  .104 

+  .  104 

-.  186 

-.  115 

-.  114 

-.  180 

-.  107 

-.  106 

12 

+  .  126 

+  .  107 

-.  180 

-.  100 

-.  106 

-.  169 

-.095 

-.  100 

13 

+  .  124 

+  .  124 

-.  175 

-.115 

-.  121 

-.  165 

-.  116 

-.122 

14 

+  .  136 

+  .  136 

-.267 

-.  153 

-.  142 

-.162 

-.135 

-.  142 

15 

+  .202 

+  .202 

-.255 

-.  149 

+  .  107 

-.  172 

-.  139 

+  .120 

16 

+  .208 

+  .208 

-.243 

-.224 

-.116 

-.217 

-.  137 

+  .  119 

17 

+  .214 

+  .214 

-.233 

-.216 

-.  125 

-.209 

-.135 

-.129 

18 

+  .218 

+  .218 

-.224 

-.209 

-.141 

-.203 

-.  182 

-.  144 

12 

+  .106 

+  .106 

-.  185 

n  =6 
-.140 

-.  149 

-.153 

-.  132 

-.  142 

13 

+  .  113 

+  .113 

-.  184 

-.  123 

-.  122 

-.  177 

-.  115 

-.  114 

14 

+  .  118 

+  .118 

-.  179 

-.  109 

-.  115 

-.  169 

-.  102 

-.  108 

15 

+  .  129 

+  .  129 

-.  176 

-.  159 

-.  115 

-.  166 

-.111 

-.  116 

16 

+  .  157 

+  .  141 

-.277 

-.  157 

-.  133 

-.163 

-.  128 

-.134 

17 

+  .  150 

+  .  150 

-.265 

-.153 

-.150 

-.160 

-.  144 

-.  150 

18 

+  .210 

+  .210 

-.255 

-.236 

+  .  105 

-.  174 

-.  142 

+  .  117 

14 

+  .  132 

+  .  113 

-.  183 

n  =7 
-.  145^ 

-.152 

-.  171 

-.137 

-.  145 

15 

+  .  122 

+  .  122 

-.182 

-.129 

-.  129 

-.  175 

-.  121 

-.121 

16 

+  .  141 

+  .  127 

-.202 

-.  164 

-.121 

-.  168 

-.  109 

-.  114 

17 

+  .  130 

+  .  130 

-.  176 

-.  162 

-.  110 

-.  166 

-.  108 

-.  112 

18 

+  .  145 

+  .145 

-.  172 

-.159 

126 

-.  163 

-.  122 

-.  127 

16 

+  .  121 

+  .  121 

-.  181 

n  =8 
-.148 

-.  155 

-.  170 

-.  140 

147 

17 

+  .  129 

+  .129 

180 

-.  166 

-.  166 

-.  174 

126 

-.125 

18 

+  .  136 

+  .  136 

-.  198 

-.  165 

-.  127 

-.168 

-.  114 

-.  120 

18 

+  .096 

+  .092 

-.180 

n  =9 
-.167^ 

-.  157 

-.  169 

-.  142 

-,  149 

659 


s 

-0.33 

-0.33 

9 

+0.33 

+0.33 

10 

-0.33 

-0.33 

11 

+0.33 

+0.33 

12 

+0.50 

+0.50 

13 

+0.50 

+0.50 

14 

+0.  50 

+0.50 

15 

+0.60 

+0.60 

16 

+0.60 

+0.60 

17 

+0.60 

+0.60 

18 

+0.67 

+0.67 

8 

+  1.20 

+  1.20 

9 

+2.29 

+2.29 

10 

+2.88 

+2.88 

11 

+3. 56 

+3.56 

12 

+4.30 

+4.30 

13 

+  5.09 

+  5.09 

14 

+  5.92 

+  5.92 

15 

+6.85 

+6.85 

16 

+7.79 

+7.79 

17 

+8.80 

+8.80 

18 

+9.88 

+9.88 

8 

+  1.40 

+  1.40 

9 

+2.00 

+2.00 

10 

+2.86 

+2.86 

11 

+3.75 

+3.75 

12 

+4.89 

+4.89 

13 

+6. 10 

+6. 10 

14 

+7.55 

+7.55 

15 

+9.08 

+9.08 

16 

+10.92 

+10.92 

17 

+12.86 

+12.86 

18 

+15.07 

+15.07 

10 

+2. 17 

+2. 17 

11 

+3.14 

+3. 14 

12 

+4.25 

+4.25 

13 

+  5.67 

+5.67 

14 

+  7.30 

+7.30 

15 

+9.27 

+9.27 

16 

+  11  50 

+11. 50 

17 

+14. 15 

+14. 15 

18 

+17. 14 

+17. 14 

12 

+3.29 

+3.29 

13 

+4.50 

+4.50 

14 

+6.  11 

+6.  11 

15 

+8.00 

+8,00 

16 

+10.36 

+10.36 

17 

+13.08 

+13.08 

18 

+  16.46 

+16,46 

14 

+4.62 

+4.62 

15 

+6.33 

+6.33 

16 

+8.40 

+8.40 

17 

+11.00 

+11.00 

1« 

+14.08 

+14.08 

16 

+6.44 

+6.44 

17 

+8.60 

+8.60 

18 

+  11.36 

+11.36 

18 

+8.70 

+8,70 

-0.89 

n  =2 
-0.79 

-0.83 

-0.92 

-0.84 

-0.84 

-0.94 

-0.89 

-0.91 

-0.96 

-0.92 

-0.92 

-0.97 

-0.95 

-0.96 

-0.98 

-0.96 

-0.96 

-0.99 

-0.97 

+  1.25 

-0.99 

-0.98 

+  1.26 

-0.99 

-0.99 

+  1.67 

-1.00 

+  1.08 

+  1.61 

-1.00 

+  1.25 

+2.12 

-0.77 

n  -3 
+0.89 

-0.70 

+1.  13 

+1.64 

+  1.38 

+1.43 

+  1.99 

+  1.70 

+  1.72 

+2.32 

+2.01 

+2.01 

+2.63 

+2.31 

+2.27 

+2.93 

+2.59 

+2.53 

+3.21 

+2.86 

+2.76 

+3.47 

+3.10 

+2.98 

+3.71 

+3.33 

+3.25 

+3.95 

+3.59 

+3.53 

+4.24 

+4.04 

-0.72 

n  =4 
+0  05 

+0.72 

+0.86 

+  1.36 

+  1.29 

+1.22 

+  1.77 

+  1.48 

+  1.56 

+2. 17 

+2.09 

+  1.89 

+2.54 

+2.23 

+2.28 

+2.94 

+2.80 

+2.70 

+3.40 

+3.16 

+3.09 

+3.85 

+3.68 

+3.47 

+4.27 

+4.42 

+3.92 

+4.72 

+4.54 

+4.36 

+5.21 

+6.27 

+0.99 

n.  =  5 
+  1.49 

+  1.50 

+  1.54 

+2. 13 

+  1.83 

+2.13 

+2.81 

+2.45 

+2.73 

+3.49 

+3.  10 

+3.33 

+4. 17 

+3.74 

+3.92 

+4.83 

+4.36 

+4.47 

+  5.45 

+5.45 

+5. 15 

+6.  17 

+6.57 

+5.86 

+6.93 

+7.91 

+1.79 

n..=6 

+2.41 

+2.11 

+2.50 

+3.22 

+2.95 

+3.25 

+4.08 

+3.73 

+4.02 

+4.95 

+4. 53 

+4.77 

+  5.80 

+  5.41 

+5.66 

+6.76 

+6.85 

+6.63 

+7.85 

+8.91 

+2.97 

n  =7 
+  3.74 

+  3.62 

+3.91 

+4.82 

+4.47 

+  5.00 

+6.06 

+  5.  52 

+6.33 

+7.  56 

+8.06 

+  7.73 

+9.  12 

+9.90 

+4.71 

n  =8 
+  5.72 

+6.49 

+6.  11 

+  7.30 

+  7.  15 

+7.80 

+9.21 

+10.08 

+7.48 

"1=9 

+9.02 

+10.87 

-0 

81 

-0 

68 

-0 

73 

-0 

80 

-0 

69 

-0 

72 

-0 

83 

-0 

72 

-0 

74 

-0 

.84 

-0 

77 

-0 

.73 

-0 

84 

-0 

80 

+0 

81 

-0 

83 

-0 

81 

-0 

75 

-0 

83 

-0 

80 

+  1 

10 

-0 

85 

-0 

78 

+0 

96 

-0 

88 

-0 

77 

+1 

39 

-0 

92 

-0 

81 

+1 

44 

-0 

95 

-0 

86 

+1 

74 

-0 

80 

+  1 

03 

+0 

80 

+  1 

30 

+  1 

82 

+1 

55 

+  1 

63 

+2 

18 

+1 

90 

+  1 

94 

+2 

53 

+2 

23 

+2 

23 

+2 

86 

+2 

54 

+2 

51 

+3 

17 

+2 

83 

+2 

77 

+3 

45 

+3 

10 

+3 

01 

+3 

71 

+3 

35 

+3 

24 

+3 

95 

+3 

59 

+3 

45 

+4 

18 

+3 

81 

+3 

65 

+4 

38 

+4 

01 

-0 

79 

+  1 

09 

+0 

85 

+0 

99 

+  1 

54 

+  1 

43 

+  1 

37 

+  1 

98 

+  1 

67 

+  1 

75 

+2 

41 

+2 

30 

+2 

11 

+2 

83 

+2 

46 

+2 

46 

+3 

22 

+3 

11 

+2 

79 

+3 

60 

+3 

18 

+3 

17 

+4 

00 

+3 

82 

+3 

55 

+4 

42 

+3 

97 

+3 

91 

+4 

82 

+4 

47 

+4 

25 

+  5 

19 

+4 

71 

+  1 

01 

+  1 

56 

+  1 

56 

+  1 

51 

+2 

16 

+  1 

82 

+2 

04 

+2. 

77 

+2 

39 

+2 

56 

+3 

38 

+2 

99 

+3 

08 

+3 

98 

+3 

51 

+3 

57 

+4 

55 

+4 

04 

+4 

04 

+5 

08 

+4 

54 

+4 

48 

+  5 

58 

+  5 

01 

+4 

90 

+6. 

05 

+5 

45 

+  1 

58 

+2 

24 

+  1 

95 

+2 

13 

+2 

89 

+  2 

+2 

67 

+3 

54 

+3 

16 

+3 

21 

+4 

16 

+3 

66 

+3 

71 

+4. 

75 

+4 

30 

+4 

18 

+  5 

29 

+4 

72 

+4 

62 

+5 

80 

+  5 

28 

+2 

20 

+2. 

98 

+2 

57 

+2 

76 

+3 

65 

+3 

29 

+3 

36 

+4. 

35 

+3 

84 

+  3 

99 

+  5. 

10 

+4 

53 

+4. 

57 

+  5, 

79 

+5. 

16 

+2. 

82 

+3. 

73 

+  3. 

38 

+  3. 

47 

+4. 

50 

+  3. 

96 

+4. 

08 

+  5. 

24 

+4. 

63 

+  3, 

48 

+  4. 

54 

+3. 

99 
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Table  2:  No  Ties 


Signed (Max | Estimated-Actual | ) 

«o  “rVd  "o 

E,=E  . 

1  gcd 

signed (Max ( 
«0 

'  (Estimated-Actual ( /Actual) ) 

Ni=N  .  E„  E,=E 

1  gcd  0  1  g 

8 

-.0516 

-.0046 

-.0442 

-.oo3r' 

-0.363 

-0.065 

-0.447 

-0.  10 

9 

-.0395 

-.0173 

-.0348 

-.0014 

-0. 355 

-0. 155 

-0.390 

-0.04 

10 

-.0375 

-.0134 

-.0276 

-.0083 

-0.341 

-0. 151 

-0.328 

-0.09 

11 

-.0304 

-.0117 

-.0250 

-.0067 

-0.320 

+0.241 

-0.303 

-0.09 

12 

-.0377 

-.0096 

-.0209 

-.0052 

-0.293 

+0.362 

-0.290 

+0.  15 

13 

-.0318 

-.0156 

-.0258 

-.0038 

-0.276 

+0.490 

-0.274 

+0.23 

14 

-.0268 

-.0134 

-.0224 

-.0072 

+0.295 

+0.627 

-0.255 

+0.32 

15 

-.0272 

-.0113 

- . 0208 

-.0065 

+0.431 

+0. 772 

-0.233 

+0.41 

16 

-.0235 

-.0114 

-.0184 

-.0057 

+0.573 

+0.924 

-0.222 

+0.  50 

17 

-.0278 

-.0099 

-.0211 

-.0048 

+0.724 

+1.084 

-0.216 

+0.60 

18 

-.0246 

-.0136 

-.0191 

-.0068 

+0.881 

+  1.252 

+0.307 

+0.70 

8 

-.0351 

-.0034 

-.0296 

-.oo2r' 

-0.293 

-0.048 

-0.463 

-0.  11 

9 

-.0326 

-.0055 

-.0204 

-.0015 

-0.272 

+0. 184 

-0.405 

-0.05 

10 

-.0224 

- . 0059 

-.0194 

-.0015 

-0.251 

+0.359 

-0.341 

+0.06 

11 

-.0204 

-.0047 

-.0168 

-.0014 

-0.231 

+0. 560 

-0.271 

+0.  11 

12 

-.0218 

-.0068 

-.0176 

+.0015 

+0.381 

+0.785 

-0.232 

+0.  18 

13 

-.0202 

-.0050 

-.0157 

-.0016 

+0.606 

+1.037 

-0.217 

+0.28 

14 

-.0198 

- . 0046 

-.0149 

+.0016 

+0.856 

+  1.317 

-0.200 

+0.38 

15 

-.0195 

-.0050 

-.0124 

+.0016 

+  1. 133 

+  1.625 

-0. 187 

+0.49 

16 

-.0158 

-.0054 

-.0121 

+  .0016 

+  1.437 

+  1.963 

-0. 174 

+0.61 

17 

-.0156 

-.0056 

-.0116 

+.0015 

+1.771 

+2.333 

+0.258 

+0.73 

18 

-.0152 

-.0055 

-.0109 

+.0015 

+2.136 

+2.734 

+0.363 

+0.86 

8 

-.0255 

-.0030 

-.0219 

+  .002^' 

-0.271 

+0.063 

-0.471 

-0.12 

9 

-.0244 

-.0063 

-.0215 

-.0018 

-0.256 

+0.258 

-0.421 

-0.07 

10 

-.0179 

+.0031 

-.0155 

+.0010 

-0.229 

+0.492 

-0.365 

+0.07 

11 

-.0222 

-.0036 

-.0180 

+.0008 

+0.345 

+0.771 

-0.306 

+0. 11 

12 

-.0200 

+  .0028 

-.0162 

+.0007 

+0.628 

+1.098 

-0.246 

+0.  17 

13 

-.0168 

-.0045 

-.0136 

+.0007 

+0.959 

+1.477 

-0.204 

+0.23 

14 

-.0153 

-.0044 

-.0123 

+.0007 

+  1.341 

+  1.913 

-0. 188 

+0.31 

15 

-.0170 

-.0034 

-.0130 

+.0007 

+  1.781 

+2  413 

-0. 169 

+0.39 

16 

-.0149 

-.0036 

-.0113 

+.0006 

+2.285 

+2.981 

-0. 152 

+0.47 

17 

-.0137 

- . 0044 

-.0104 

+ . 0006 

+2.857 

+3.624 

-0. 140 

+0.55 

18 

-.0124 

-.0039 

-.0092 

+.0006 

m  >.c 

+3.504 

+4.348 

+0.153 

+0.62 

10 

-.0238 

-.0036 

-.0200 

+.ooir 

-0.223 

+0.535 

-0.375 

+0.06 

11 

-.0167 

-.0033 

-.0142 

+ . 0009 

+0.425 

+0.874 

-0.331 

+0.  10 

12 

-.0173 

-.0041 

-.0142 

+.0007 

+0.775 

+1.283 

-0.293 

+0.  15 

13 

-.0173 

-.0033 

-.0137 

+.0006 

+1. 198 

+  1. 774 

-0.264 

+0.20 

I/! 

-.014’ 

-.0037 

-  0115 

+ . 0004 

+  1.703 

+2.357 

-0.249 

+0.24 

15 

-.0143 

-.0040 

-.0112 

+  .0004 

+2 . 303 

■fj .  046 

-0.254 

+0.28 

16 

-.0143 

-.0032 

-.0109 

+.0004 

+3.012 

+3.854 

-0.286 

+0.31 

17 

-.0120 

-.0035 

-.0093 

+.0004 

+3.844 

+4.797 

-0.350 

+0.32 

18 

-.0121 

-.0029 

-.0910 

+.0003 

+4.814 

+5.892 

-0.454 

+0.32 

12 

-.0151 

-.0031 

-.0126 

+,ooSr" 

+0.824 

+  1.345 

-0.311 

+0.  14 

IJ 

-.0179 

-.0030 

-.0114 

+ .  0006 

+1.317 

+  1.922 

-0. 313 

+0.  17 

14 

-.0159 

-.0030 

-.0105 

+.0005 

+1.922 

+2.626 

-0.346 

+0.  19 

15 

-.0145 

-.0028 

-.0096 

+ .  0004 

+2.661 

+3.4/9 

-u.  4z,3 

+C.  19 

16 

-.0134 

-.0027 

-.0103 

+.0003 

+3.557 

+4. 508 

-0. 559 

+0,  16 

17 

-.0124 

-.0027 

- . 0094 

+ .  0003 

+4.639 

+5.742 

-0.769 

-0.28 

18 

-.0116 

-.0026 

-.0088 

+.0003 

+  5.935 

+  7.212 

-1.000 

-0.57 

14 

-.0145 

-.0025 

-.0116 

+.oooi 

+  1.995 

+2.715 

-0.382 

+0.  17 

15 

-.0122 

-.0030 

- . 0099 

+.0003 

+2.841 

+3.698 

-0. 520 

+0.  13 

16 

-.0124 

-.0025 

-.0097 

+ .  0003 

+3.892 

+4.909 

-0.753 

-0.30 

17 

-.0107 

-.0028 

- . 0084 

+ .  0003 

+5. 188 

+6.394 

-1.000 

-0.65 

18 

-.0109 

-.0024 

-.0083 

+ .  0002 

+6.779 

+8.205 

-1.000 

-1.00 

16 

-.0114 

-.0028 

- . 0090 

+.ooSr® 

+4 . 004 

+5.044 

-0.822 

-0.37 

17 

-.0108 

-.0028 

-.0085 

+  .0002 

+5. 468 

+6.726 

-1.000 

-0.85 

18 

-.0102 

-.0027 

-.0079 

+ .  0002 

+7.300 

+8.818 

-1.000 

-1.00 

18 

- . 0096 

-.0025 

-.0075 

II 

=  O 

o 

o 

+7.476 

+9.025 

-1.000 

-1.00 
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ABSTRACT 

Speaman’s  Footrule,  D,  is  the  sum  of  the 
absolute  values  of  the  differences  between  the 
ranks  in  two  rankings  of  n  objects.  For  the 
case  of  equally  likely  permutations,  tables  of 
the  exact  cumulative  distribution  function 
(c.d.f.)  of  D  are  given  for  11  <  n  <  18.  For 

both  Spearman’s  Footrule  and  Rank  Correlation 
Coefficient  the  maximum  difference  between  the 
exact  c.d.f.  and  the  normal  approximation  is 
given  as  well  as  the  maximum  difference  between 
the  exact  c.d.f.  and  the  normal  approximation 
with  correction  for  continuity  and  comparisons 
made. 

1.  Introduction 

Given  two  rankings  of  n  objects  or 

equivalently,  two  permutations  p  and  q,  a  widely 
used  non-par ametric  measure  of  association 
between  the  rankings  is  Spearman’s  c  given  in 
unnormalized  form  as  S,  where 

S  (P.q)  =  I  (P,  -  q,)^  (1.)) 

1=1 

S  is  related  to  Spearman’s  Ra^k  Correlation 
Coefficient  by  .  =  1  -  6S/{n  -n)  and  its 

derivation  along  with  many  of  its  properties  and 
moments  can  be  found  in  Kendall  (1970).  In 

particular  it  is  shown  ^ there  that  S  is 

distributed  from  0  to  (j  -n)/3  on  the  even 
integers  with  a  mean  of  (n  -n)/6  .ind  variance  of 
(n  -n)  /36(n- 1 ) . 

An  equally  simple  but  neglected  competitor  is 
Spearman’s  (1904)  footrule  given  by  D,  where 

n 

0  (p.q)  =  p  -  q_  .  (1.2) 

1  =  1 

The  footrule  has  historically  not  been  an 

ir^ortant  measure  of  association  because  of  a 
lack  of  desirable  statistical  properties  (cf. 
Pearson,  1907  and  Kendall,  1970).  Diaconis  euid 

Graham  (1977)  recently  revived  interest  in  D  by 
treating  it  as  a  metric  on  the  set  of 

permutations,  establishing  its  limiting 
normality  by  use  of  Hoeifding’s  (l?51i 
combinatorial  central  limit  theorem  and  siiow  it 
to  be  .-elated  to  Kendall’s  t  given  by  T,  where 

n-1  n 

T(p,q)  =  r.  sign  (p  -p.)  sign  (q,-q,). 

i=l  j=i+l  (1.3) 

They  were  able  to  show  T  <  D  <  2T  and 

concluded  that  S  is  probably  the  better  metric 
with  D  and  T  roughly  the  same.  While  0  is 
somewhat  easier  to  interpret  directly,  T  had  the 
advantage  of  having  the  exact  table  tabulated 
for  the  c.d.f.  for  small  sample  sizes  (cf. 
Kendall,  1970). 


Dry  and  Kleinecke  (1979)  tabulated  the  exact 
c.d.f.  for  D  for  n=  2(1)10  and  gave  an 
approximate  table  for  the  c.d.f.  for  D  for 
n=ll(l)15  generated  by  Monte  Carlo 
approximation.  They  also  conjectured  about  the 
rate  of  convergence  to  an  approximating  normal 
distribution  and  that  an  improvement  in  the 
approximation  could  be  accomplished  by  using  a 
standard  half-interval  continuity  correction  of 
+1  to  be  applied  for  the  c.d.f.  of  D.  Diaconis 
and  Graham  (1977)  calculated  the  asymp|otic  mean 
and  variance  of  D  apd  showed  0  <  D  <  n  /2  for  n 
even  and  0  C  D  J  (n  l)/2  for  n  odd  on  the  even 
integers.  Spearman  (1904)  and  Kleinecke,  Ury 
and  Wagner  (1962)  derived  the  exact  mean  and 
variance  of  D  given  by 

E(D)  =  (n^  -  l)/3  (1.4) 

and 

Var  (D1  =  (n+l)(2n  +  7)/45.  (1.5) 

2.  Calculation  of  Exact  Tables  of  the  Null 
Distribution  of  S  and  D  for  n  -  11(1)18 

Assuming  the  null  distribution  of  S  or  D 
means  all  possible  n'  permutations  are  equally 
likely.  By  calculating  each  possible 

permutation  and  comparing  it  to  a  base  ranking 
1,2,3  ...  n  the  corresponding  S  or  D  can  be 

calculated  and  counted,  yielding  the  frequency 
distribution  of  either.  Dividing  each  frequency 
by  nl  gives  the  corresponding  probability 
density  function  for  .5  or  D  and  summing  yields 
the  exact  c.d.f.  While  conceptually  easy,  a 
straight  forward  approach  for  calculating  the 
exact  c.d.f.  when  n  =  17  requires  35,  568,  728, 

096,  000  permutations  to  be  calculated,  each 

involving  the  finding  the  sum  of  17  terms  of 
absolute  values  of  differences  for  C  or  the  sum 
of  17  terms  of  squared  differences  for  S.  This 
would  involve  staggering  amounts  of  computer 
time.  However,  Table  1,  which  displays  the 
exact  c.d.f.  for  D  for  n  up  to  18,  was 
calculated  utilizing  permutations,  combinations, 
and  stored  arrays  as  outlined  by  Franklin 
(1987).  In  that  paper  the  exact  c.d.f.  for  S 
was  presented  for  n  =  12(1)16  using  the  same 
method.  A  later  paper  by  Franklin  (1987) 
extended  the  table  for  S  to  n  =  17  and  18  again 
using  this  technique. 

In  this  technique  first  k  was  chosen 
(approximately  n/2)  anu  then  all  the  kf  and  (n- 
k)'  permutations  of  {l,2,..,k}  and  {l,2,..,n-k} 
were  stored  in  matrices  A  and  B  (respectively). 
Next,  a  k-sized  combination  of  integers  from 
{l,2,...n)  and  its  resulting  combination  of 
(n-k)  integers  (the  "remainder")  was  determined. 
All  possible  k!  permutations  were  formed  of  the 
k-sized  combination  with  each  permutation 
compared  to  the  base  ranking  of  t,2,..,k  and  a 
corresponding  sum  of  absolute  differences 
calculated  and  stored  in  a  matrix  .  Then  all 
possible  (n-k)!  permutations  of  the  "remainder" 
were  found  with  each  permutation  compared  to  the 
base  ranking  of  k+l,k+2,..,n  and  a  corresponding 
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SUB  of  absolute  differences  calculated  and 
stored  in  a  natrix  .  Multiplying  the  counts 
of  S  (i)  by  the  counts  of  S^(j)  and  placing  the 
resulting  count  in  S(i+j)  results  in  an 
equivalent  process'  g  of  (k!)  x  (n-k)!  permuta¬ 
tions  for  S.  1  lually  suimation  over  all  (  ) 

combinations  gi .es  the  complete  n!  permutations 
for  the  I-  1  distribution.  (Copies  of  the 
program  are  available  from  the  author.)  All 
calculations  were  done  in  jjyadruple  preci.sion 
a' lowing  accuracy  of  1  x  10  on  a  Harris  1000 
"Super  Mini"  computer.  Internal  checks  on  the 
number  of  permutations  and  external,  theoretical 
checks  on  the  cumulative  distribution  assured 
complete  accuracy.  Use  of  this  technique  allows 
several  orders  of  magnitude  of  decrease  in 
calculation  time. 

Comparison  of  the  exact  distribution  of  T.able 
I  with  the  table  produced  by  Monte  Carlo 
simulation  by  Ury  and  Kleineche  (1979)  shows 
remarkable  accuracy.  The  largest  difference 
was  .003  with  most  entries  differing  by  .001  or 
less  for  n  <  15. 

3.  Normal  Approximations  to  the  Exact 
Distribution  of  S  and  D 

In  order  to  determine  the  precise  degree  of 
convergence  in  distribution  of  S  and  0  to 
normality,  two  approximations  to  the  exact 
c.d.f.  of  both  were  investigated  by  numerical 
integration:  the  normal  approximation  and  the 

normal  approximation  with  continuity  correction 
(C.C.).  Table  2  displays  the  maximum  c.d.f. 
exact  -  c.d.f.  approx.  over  all  possible 
values  of  S  and  0  for  both  approximations  for 
n  =  9  through  18. 

Ury  and  Kleinecke  (1979)  stated  that  "despite 
the  asymmetry  of  D,  the  tendency  toward  the 
normal  distribution  is  rather  fast  (much  faster 
than  R),"  Spearman’s  Rho.  And  that  "the  approx¬ 
imation  is  quite  good  for  n  =  10."  However, 

examination  of  Table  2  show  the  < ontrary.  For 
the  uncorrected  ncrmal  approximation  for  n  =  10, 
the  error  ran  be  as  large  as  .046  and  is 
over  .025  for  22  <  D  r  44.  These  values  of  [) 
occur  with  probability  over  .93.  Even  for  n=18, 
the  uncorrected  normal  approximation  can  have  a 
maximum  absolute  error  as  large  as  .019.  How 
ever,  substantial  improvement  is  obtained  by 
using  the  normal  approximation  with  c.r.  with 
maximum  absolute  error  of  .009  for  n  -  18. 

Furthermore,  while  for  D  and  S  the  maximum 
absolute  error  decreases  monot on i ca 1 1 y  as  n 
increases  for  both  approximations,  the  conver¬ 
gence  of  0  IS  rather  slow  compared  to  .S. 
Comparing  the  same  maximum  absolute  error  as 
presented  by  Table  2  for  corresponding  n  shows 
the  normal  approximation  of  S  to  have  less  than 
half  the  error  of  the  normal  approximation  of  D 
(.046  for  D  versus  .018  for  5  at  n  10  and  .019 
for  D  versus  .008  for  S  at  n  ;  18).  Comparing 
the  maximum  absolute  error  for  corresponding  n 
when  the  normal  approximations  have  t  h«‘ 
continuity  correction  factor  shows  S  to  have 
about  2/3  the  error  of  0  (.015  for  I)  versus  .012 
for  S  at  n  10  and  .009  for  0  versus  .006  for 
S  when  n  =  18).  The  extra  ei  i  or  of  0  seejns  to 
be  attributable  to  the  aa>nmietry  of  distribution 
of  the  foot  rule  that  seems  s i gn 1 f 1 1  an 1 1 y  present 
for  even  n  18. 


4.  Concluaioo  and  RecemBendations 

The  exact  c.d.f.  of  Spearman’s  footrule  and 
Spearman’s  Rank  Correlation  coefficient  should 
be  used  for  n  <  18,  since  they  now  exist.  For 
n>  19,  the  straight  normal  approximation  for  D 
should  be  avoided  in  favor  of  the  normal  approx¬ 
imation  with  continuity  correction  factor.  For 
all  n  >  19,  such  an  approximation  will  have  an 
error  of  <  .006  for  any  value  of  D  and  will  have 
even  smaller  error  (  =  .003)  for  most  upper  and 
lower  tail  values.  Convergence  to  normality  of 
Spearman’s  footrule  is  sigiiificantly  slower  than 
the  convergence  of  Spearman’s  Rho. 

Seven  different  approximations  to  S  have  been 
presented  by  Franklin  (1987).  The  clearly  and 
dramatically  superior  approximation  was  shown  in 
that  paper  to  be  a  Pearson  Type  II 
approximation.  For  further  discussion  of  that 
approximation  see  Olds  (1938)  and  Zar  (1972). 
For  that  approximation  the  maximum  absolute 
error  is  given  by  .000158  for  n  =  18. 

REFERENCES 

Diaconis,  P.  and  Graham,  R.  L.  (1977). 
Spearman’s  Footrule  as  a  Measure  of  Disarray. 
J.  R.  Statist.  Soc.  B,  39,  262-268. 

Franklin,  L.  A.  (1987).  The  Complete  Exact  Null 
Distribution  of  Spearman’s  Rho  for  n  =  12(1) 

16.  Proceedings  of  the  19th  Symposium  on  the 
Interface  Between  Statistics  and  Commputer 
Science,  March  1987. 

Franklin,  L.  A.  (1987).  Approximations, 
Convergence  and  Exact  Tables  for  Spearman’s 
Rank  Correlation  Coefficient.  Proceedings  of 
tte  Statistical  Computation  Section,  American 
Statistical  Association,  Fall,  1987. 
Hoeffding,  W.  (1951).  A  Combinatorial  Central 
Limit  Theorem.  Ann.  Math.  Statist.,  22,  558 
566. 

Kendall,  M.  G.  (1970).  Rank  Correlation 
Methods,  4th  ed.  London:  Griffin. 

Kleinecke,  D.  C.,  Ury,  H.  K.,  and  Wagner,  L.  F. 
(1962).  Spearman’s  Footrule--an  Alternative 
Rank  .Statistic.  Civil  Defense  Research 
Project,  Institute  of  Engineering  Research, 
University  of  California  Berkeley,  Report  No. 
CDRP- 182- 1 14 ,  November  1962. 

Olds,  E.  G.  (1938).  Distribution  of  Sums  of 
Squares  of  Rank  Difference  for  Small  Numbers 
of  Individuals.  Annals  of  Mathematical 
Statistics,  9,  133-148. 

Pearson,  K.  (1907).  Mathematical  Contributions 
to  the  Theory  of  Evolution.  XVI.  On  Further 
Methods  of  Determining  Correlation.  Draper's 
Co.  Res.  Mem.,  Biometric  Series  IV. 
Cambridge  University  Press. 

.Spearman,  C.  (1904).  The  Proof  and  Measurement 
of  Association  Between  Two  Things.  Amer.  J. 
Psychol.,  15,  72-101. 

(1906).  "Footrule"  for  Measuring 
Correlation.  Brit.  J.  Psychol.,  2,  89-108. 
Ury,  H.  K.  and  Kleineckc*,  D.  C.  (1979).  Tables 
of  the  Distribution  of  Spearman’s  Footrule. 
.Applied  Statist  If  s,  28,  271  275. 

Zar,  Jerroes  H.  (1972).  Significance  Testing  of 
the  Spearman’s  Rank  Correlation  Coefficient. 
Journal  of  the  American  .Statistical 
As.soc I  at  i on,  67,  578  580. 


b63 


Table  1 


Exact  cumulative  null  distributor  of  D 


d/n 

11 

12 

13 

14 

15 

16 

17 

18 

0 

0.25-7 

0.21-8 

0.16-9 

0.1-10 

0.8-12 

0.5-13 

0.3-14 

0.2-15 

2 

0.28-6 

0.25-7 

0.21-8 

0.16-9 

0.1-10 

0.8-12 

0.5-13 

0.3-14 

4 

0.19-5 

0. 18-6 

0.16-7 

0. 13-8 

0.10-9 

0.7-11 

0.5-12 

0.3-13 

6 

0.93-5 

0.98-6 

0.93-7 

0.81-8 

0 . 65-9 

0.5-10 

0.3-11 

0.2-12 

8 

0.38-4 

0.43-5 

0.44-6 

0.40-7 

0.34-8 

0.27-9 

0.2-10 

0.1-11 

10 

0. 13-3 

0. 16-4 

0.17-5 

0.17-6 

0.  15-7 

0.  13-8 

0. 10-9 

0.7-11 

12 

0.39-3 

0.52-4 

0.61-5 

0.63-6 

0.60-7 

0.53-8 

0.42-9 

0.3-10 

14 

0.0011 

0.15-3 

0.  19-4 

0.21-5 

0.21-6 

0. 19-7 

0 . 16-8 

0.13-9 

16 

0.0026 

0.39-3 

0.53-4 

0.63-5 

0.67-6 

0.65-7 

0.58-8 

0.47-9 

18 

0.0056 

0.94-3 

0. 14-3 

0. 17-4 

0.19-5 

0.20-6 

0.19-7 

0.  16-8 

20 

0.0113 

0.0021 

0.32-3 

0.44-4 

0.52-5 

0.57-6 

0.55-7 

0.50-8 

22 

0.0211 

0 . 0042 

0.70-3 

0.  10-3 

0.13-4 

0.  15-5 

0.  15-6 

0.  14-7 

24 

0.0368 

0.0080 

0.0014 

0.23-3 

0.31-4 

0.37-5 

0.40-6 

0.40-7 

26 

0.0606 

0.0143 

0.0028 

0.47-3 

0.67-4 

0.86-5 

0.99-6 

0.10-6 

28 

0.0946 

0.0244 

0.0051 

0.92-3 

0. 14-3 

0. 19-4 

0.23-5 

0.25-6 

30 

0.  1403 

0.0395 

0.0090 

0.0017 

0.28-3 

0.40-4 

0.51-5 

0.58-6 

32 

0.1990 

0.0611 

0.0150 

0.0030 

0.53-3 

0.80-4 

0.11-4 

0.13-5 

34 

0.2700 

0.0907 

0.0240 

0.0052 

0.96-3 

0.15-3 

0.22-4 

0.27-5 

36 

0.3522 

0.1295 

0.0370 

0.0086 

0.0017 

0.28-3 

0.42-4 

0.55-5 

38 

0.4420 

0. 1782 

0.0550 

0.0137 

0.0028 

0.51-3 

0 . 79-4 

0. 11-4 

40 

0.5363 

0.2369 

0.0791 

0.0210 

0.0046 

0.87-3 

0. 14-3 

0.20-4 

42 

0 . 6295 

0.3049 

0.1102 

0.0314 

0.0073 

0.0015 

0.25-3 

0 . 38-4 

44 

0.7180 

0.3807 

0. 1490 

0.0454 

0.0113 

0.0024 

0.42-3 

0.67-4 

46 

0.7955 

0.4619 

0.1958 

0.0640 

0.0169 

0.0037 

0.70-3 

0.12-3 

48 

0.8617 

0.5455 

0.2506 

0.0878 

0.0246 

0.0057 

0.0011 

0.20-3 

50 

0.9120 

0.6281 

0.3127 

0.1174 

0.0350 

0.0086 

0.0018 

0.32-3 

52 

0.9498 

0.7063 

0.3808 

0.1535 

0.0486 

0.0126 

0.0027 

0.52-3 

54 

0.9740 

0.7771 

0.4531 

0.1960 

0.0660 

0.0180 

0.0041 

0.81-3 

56 

0.9895 

0.8383 

0.5277 

0.2451 

0.0877 

0.0253 

0.0061 

0.0013 

58 

0.9960 

0.8888 

0.6018 

0.3001 

0. 1144 

0.0348 

0.0088 

0.0019 

60 

1.0000 

0.9281 

0.6735 

0.3602 

0.1462 

0.0470 

0.0125 

0.0028 

62 

0.9569 

0.7400 

0.4243 

0. 1834 

0.0623 

0.0173 

0.0041 

64 

0.9766 

0.7999 

0.4908 

0.2259 

0.0812 

0.0237 

0.0058 

66 

0.9886 

0.8515 

0.5580 

0.2736 

0.1040 

0.0319 

0.0082 

68 

0.9953 

0.8946 

0.6242 

0.3258 

0.1310 

0.0422 

0.0113 

70 

0.9989 

0.9284 

0.6876 

0.3819 

0. 1624 

0.0550 

0.0154 

72 

1 . 0000 

0.9545 

0.7466 

0.4408 

0. 1984 

0.0706 

0.0206 

74 

0 . 9725 

0.7999 

0.5013 

0.2387 

0.0894 

0.0273 

76 

0.9850 

0.8467 

0.5623 

0.2832 

0.1115 

0.0356 

78 

0 . 9925 

0.8863 

0.6222 

0.3314 

0.  1373 

0.0458 

80 

0.9971 

0.9188 

0.6799 

0.3828 

0. 1168 

0.0583 

82 

0.9989 

0.9443 

0.7340 

0.4364 

0.2001 

0.0732 

84 

1.0000 

0.9636 

0.7838 

0.4915 

0.2371 

0.0907 

86 

0.9776 

0.8282 

0.5471 

0.2777 

0.1112 

88 

0.9871 

0.8670 

0 . 6022 

0.3214 

0. 1348 

90 

0.9932 

0.8998 

0.6556 

0 . 3678 

0. 1615 

92 

0.9968 

0.9269 

0.7066 

0.4164 

0.1914 

94 

0.9987 

0.9484 

0,7542 

0.4665 

0 . 2246 

96 

0.9997 

0.9650 

0.7978 

0.5174 

0.2607 

98 

1 . 0000 

0.9772 

0.8369 

0.5682 

0.2997 

100 

0.9861 

0.8712 

0.6182 

0.3413 

102 

0.9919 

0.9006 

0.6667 

0.3849 

104 

0.9957 

0.9252 

0.7129 

0.4302 

106 

0.9979 

0.9452 

0.7562 

0 . 4766 

108 

0.9992 

0.9611 

0.7961 

0.5234 

110 

0 . 9997 

0.9733 

0.8322 

0.5702 

112 

1.0000 

0.9824 

0.8643 

0.6163 

114 

0 . 9889 

0.8923 

0.6611 

116 

0.9934 

0.9162 

0.7040 

118 

0.9963 

0.9362 

0.7445 

120 

0.9981 

0.9526 

0.7822 

122 

0.9991 

0.9656 

0.8168 

124 

0.9996 

0.9758 

0.8481 
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Table  1  -  Continued 

Exact  cumulative  null  distributor  of  D 


d/n 

11 

12 

13 

14 

15 

16 

17 

18 

126 

0.9999 

0.9835 

0.8759 

128 

1.0000 

0.9892 

0.9002 

130 

0.9932 

0.9211 

132 

0.9959 

0.9388 

134 

0.9977 

0.9535 

136 

0.9988 

0.9654 

138 

0.9994 

0.9748 

140 

0.9997 

0.9821 

142 

0.9999 

0.9877 

144 

1.0000 

0.9918 

146 

0.9947 

148 

0.9968 

150 

0.9981 

152 

0.9990 

154 

0.9995 

156 

0.9997 

158 

0.9999 

160 

0.9999 

162 

1.0000 

Table  2 


Maxinun  f  c,d.f. 

exact 

-  c. 

d.f.  approx. 

1  Over 

All 

Possible  Values  of 

n 

10 

11 

12 

13 

14 

15 

16 

17 

18 

Normal 

.046 

.040 

.035 

.031 

.028 

.025 

.023 

.021 

.019 

Approximation 
at  0  = 

30 

36 

42 

50 

58 

68 

76 

86 

96 

Normal 

Approximation 
with  C.C. 

.015 

.014 

.013 

.012 

.011 

.011 

.010 

.010 

.009 

at  0  - 

34 

42 

50 

58 

68 

78 

88 

100 

112 
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THE  EFFECTS  OF  HEAVY  TAILED  DISTRIBUTIONS  ON  THE  TWO-SIDED  k-SAMPLE  SMIRNOV  TEST 


Henry  D.  Crockett  and  M.  M.  Whiteside,  University  of  Texas  at  Arlington 


Abstract:  This  paper  presents  the  problem  that 
the  k-saraple  Smirnov  test  has  in  discriminating 
the  ranking  of  samples  from  heavy  tailed 
probability  distribution  functions.  The  test 
results  for  1000  tests  are  presented  for  each  of 
seven  levels  of  variance  and  five  scaler  offsets 
for  the  two  distributions. 

1.  Introduction.  Given  nk  random  variables 
{X..},  i=l , . . ,n, j=l , . . ,k,  which  represent  k 
raniiom  samples  of  equal  size  n  from  an 
absolutely  continuous  distribution  function 
F.(x).  The  k-sample  Smirnov  test  is  used  to 
d^te  rmine  if  the  population  distribution 
functions  F(x)  are  identical.  Thus  the 
hypotheses  would  be: 

Hq  ;Fj(x)  =  F^Cx)  =  ...  =  Fj^(x)  for  all  x, 

Hj  :F^(x)  =  for  some  t,  u,  and  x. 

In  order  to  perform  the  test,  the  sample  must  be 
ordered  within  themselves,  so  that  the  rth 
ordered  sample  is  Zj  <  ,..<Z  Then  Z.  is 

the  ith  order  statistic  from  the^rth  sample^  "^and 
Z  is  the  extreme  of  the  rth  sample.  Thus  the 
empirical  distribution  function  of  the  rth 
sample  would  be: 


F^(x)  =  0 
F^(x)  =  a/n 
F^(x)  =  1 


i  f  X  <  Z .  , 

ir 


i  f  L  <=  X  <:=  A 

ar  a-H,r, 

if  Z  <=  X. 
nr 


The  samples  are  then  ordered  among  themselves  on 
the  basis  of  their  most  extreme  points. 
Therefore,  if  S  is  the  set  of  extremes  from  the 
k-saroples  such  that  S  =  {Z  ,  r  =  l,..,k}.  The 
set  S  is  then  ordered  to  determine  the  smallest 
and  the  largest  Z  ,  the  sample  related  to  the 
smallest  Z  is  tRen  the  sample  of  rank  1  or 
F^  '^(x),  anS  the  sample  related  to  the  largest 
Z  is  the  sample  of  rank  k  or  F  (x).  The 
test  statistic  Tj  is  then  defined  by  Conover 
(1980)  as  the  maximum  vertical  distance  between 
F  (x)  and  F  (x) .  Mathematically  this  is 
stated  as: 

Tj  =  sup^(F^*^x)  -  F^'^^x)). 


This  test  statistic  would  then  be  compared  to 
a  table  value  to  obtain  a  decision  for  a  given 
level  of  a. 

2.  Effect  of  Heavy  Tailed  DistributlonB ■ 

The  method  for  choosing  the  largest  and  smallest 
sample  for  comparison  appears  to  be  susceptible 
to  error  when  choosing  among  heavy  tailed 
distributions.  These  distributions  are  more 
likely  to  have  extreme  values  due  to  the  nature 
of  their  probability  distribution  functions 
(p.d.f.s)  than  a  more  leptokurtic  p.d.f.  This 
would  Indicate  that  if  k  samples  were  drawn  from 
populations  with  the  same  p.d.f.s,  of  which  one 
differed  from  the  others  only  by  some  scaler 
factor,  the  p.d.f.s  with  heavier  tailed 
distributions  would  choose  the  true  ’’largest" 


sample  less  frequently  than  the  p.d.f.s  with 
smaller  tails.  In  this  way,  extreme  values  in 
the  samples  could  greatly  affect  which  one  of 
the  samples  was  chosen  as  the  "largest"  sample 
(Fig.  1).  Therefore,  since  the  true  "largest" 
sample  is  determined  less  often,  the  test 
statistic  is  comparing  two  samples  from  truly 
equivalent  distributions;  therefore,  the 
probability  of  failing  to  reject  should  equal 
1-Cf,  when  in  fact  the  null  hypothesis  is  false. 

In  order  to  show  how  these  differences  affect 
the  two-sided  K-sample  Smirnov  test,  a 
simulation  was  performed.  The  two  p.d.f.s  which 
were  compared  were  the  uniform  distribution  and 
the  double  exponential  distribution.  These  were 
performed  on  sample  sizes  of  6,  12,  and  30.  For 
each  distribution  and  each  sample  size,  several 
levels  of  variance  and  scaler  factor  movement  of 
one  of  the  three  samples  were  considered.  The 
levels  of  variance  were  10,  25,  50,  75,  100, 

150,  and  200.  The  scaler  factor  added  to  one  of 
the  samples  drawn  were  0,  1,  2,  4,  and  8. 

Random  number  generation  and  test  results  for 
1000  tests  for  each  level  of  sample  size, 
variance,  and  factor  were  performed  using  the 
Statistical  Applications  System  (SAS).  The  null 
hypothesis  is  that  the  population  distribution 
functions  are  identical  at  the  0=0.05  level  of 
significance. 

3.  Simulation  Results.  The  result  of  the 
simulation  were  mixed.  The  expected  results 
would  be  that  the  Smirnov  test  would  detect  a 
scaler  factor  change  in  one  sample  moi '  often 
for  a  uniform  distribution,  and  would  herefore 
reject  the  null  hypothesis  more  often  .Ian  the 
samples  from  double  exponentials.  Thet  :  tests 
were  all  performed  at  the  a=0.05  level  \  i  the 
percentage  of  rejections  were  determinec  or 
each  level  (Tables  1-6),  Also  for  each  vel 
the  percentage  of  times  the  sample  which  a 
scaler  factor  change  was  chosen  as  the  lai ^est 
sample  is  shown  in  parentheses.  The  resul.  ? 
appear  to  be  conflicting  because  although  for 
each  level  of  all  factors  the  uniform 
distribution  correctly  chose  the  true  larges* 
sample  more  frequently  than  the  double 
exponential,  it  also  appears  that  the  rejection 
rate  was  larger  for  the  double  exponential  than 
for  uniform  distributions. 

Upon  closer  inspection  of  the  distribution 
functions,  however,  this  result  appears  to  be 
justified.  From  Figures  2  and  3,  it  is  apparent 
that  given  two  distributions  functions  of 
uniform  density,  one  of  which  has  been  adjusted 
by  a  scaler,  the  distance  between  the  two 
distributions  is  much  less  than  two  double 
exponentials  that  have  been  separated  by  the 
same  amount.  Therefore,  even  though  the  correct 
largest  sample  was  chosen  more  often  from  the 
samples  of  the  univarite  p.d.f.s,  the  decision 
to  reject  the  null  hypothesis  may  not  have  been 
made  due  to  smaller  values  of  T^. 

4.  Conclusions  and  Speculation.  Since  the 
assumption  that  keptokurtic  distributions  will 
correctly  identify  a  similar  distribution 
function  which  has  been  adjusted  by  an  offset 
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Figure  I 


UNIFORM  DISTRIBUTION  (k=3,n=6)  DOUBLE  EXPONENTIAL  (k=3,n=fi) 
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0 

1 

2 

4 

8 

0.044 

0.055 

0.104 

0.323 

0.951 

0.040 

0.067 

0.168 

0.536 

0.932 

10 

(0.347) 

(0.646) 

(0.826) 

(0.962) 

(1.000) 

10 

(0.357) 

(0.429) 

(0.567) 

(0.784) 

(0.957) 

0.038 

0.047 

0.066 

0.128 

0.468 

0.036 

0.051 

0.085 

0.238 

0.637 

25 

(0.331) 

(0.562) 

(0.688) 

(0.883) 

(0.996) 

A 

25 

(0.336) 

(0.390) 

(0.459) 

(0.619) 

(0.814) 

0.036 

0.043 

0.063 

0.085 

0.268 

R 

0.044 

0.047 

0.069 

0.144 

0.428 

50 

(0.325) 

(0.489) 

(0.598) 

(0.779) 

(0.950) 

SO 

(0.340) 

(0.376) 

(0.432) 

(0.512) 

(0.727) 

0.035 

0.039 

0,031 

0.072 

0.205 

0.039 

0.042 

0.058 

0.114 

0.303 

75 

(0.342) 

(0.453) 

(0.581) 

(0.724) 

(0.929) 

A 

75 

(0.346) 

(0.367) 

(0.409) 

(0.497) 

(0.644) 

0.U41 

0.036 

0.036 

0.058 

0.151 

N 

0.040 

0.046 

0.045 

0.090 

0.238 

100 

(0.334) 

(0.431) 

(0.538) 

(0.679) 

(0.874) 

100 

(0.351) 

(0.336) 

(0.373) 

(0.486)1 

(0.644) 

0.044 

0.042 

0.058 

0.063 

0.130 

0.043 

0.049 

0.038 

0.062  1 

0.170 

150 

(0.308) 

(0.457) 

(0.485) 

(0.659) 

(0.809) 

E 

150 

(0.333) 

(0.338) 

(0.377) 

(0.444) 

(0.580) 

0.034 

0.041 

0.039 

0.048 

0.080 

0.037 

0.036 

0.049 

0.069 

0.137 

200 

(0.332) 

(0.417) 

(0.487) 

(0.612) 

(0.788) 

» 

200 

(0.334) 

(0.341) 

(0.371) 

(0.421) 

(0.543) 

SIMULATED  POWER  COMPARISONS  OF  MRPP  RANK  TESTS  AND  SOME  STANDARD  SCORE  TESTS 


Derrick  S.  Tracy  and  Khushnood  A.  Khan,  University  of  Windsor 


Abstract 

Two  MRPP  rank  tests  and  two  standard  score 
tests  -  median  and  Fisher,  are  compared  with 
respect  to  their  empirical  powers,  computed  from 
extensive  simulations  from  normal,  Cauchy  and 
Laplace  underlying  populations.  This  is  done  for 
several  combinations  of  sample  sizes  -  unequal 
and  equal . 

In  applying  classical  linear  rank  tests  to 
meteorological  data,  Mielke,  Berry  and  Medina 
(1982)  posed  the  problem  that  in  most  cases  the 
analysis  space  associated  with  these  tests  is 
non-metric,  and  hence  the  p-values  may  not  be 
interpreted  correctly.  An  alternative  inference 
technique  known  as  multiresponse  permutation 
procedure  (MRPP)  is  proposed,  of  which  a 
generalized  version  is  discussed  in  Mielke  (  1984). 
Let  =  {w^,...,w^}  be  a  finite  population  of 

N  objects,  each  of  which  has  r  responses,  and 
the  responses  have  the  same  range  via  a  rank 

g 

order  transformation.  We  let  K  *  I  N.  of  these 

1  1 

be  classified  into  g  mutually  exclusive  subgroups 
according  to  some  a  priori  classification  scheme. 
The  excess  N-K  observations  are  in  the  (g+l)^^ 
subgroup.  Then  the  MRPP  test  statistic  is 
8  g 

defined  as  6  *  E  C.^..  ,  where  C .  >0  ,  EC.*1, 

111’  i  ’  ^  1 

N.  -I  N 

C.  *  ^  A  S.(w  )S.(w  ),  S.(w  )  being  an 

1  2  IJ  i  I  1  J  i  I  ® 

indicator  function,  and  the  symmetric 

distance  function  (  E  |  w  -w  [  p>l ,  v>0. 

k-1 

(When  r-1,  p  is  irrelevant.)  The  analysis  space 
is  non-metric  for  v>l  and  metric  for  v<l.  The 
majority  of  the  permutation  tests  used  in  practice 
are  based  on  v=2.  The  choice  of  p=2  and  v=l  is 
recommended . 

Under  c lass i f ica t ion  is  random,  equal 

probabilities  are  assigned  to  the  M  =  N!  (N.  !) 

possible  allocations  of  the  N  objects  into  the 
subgroups.  H  is  rejected  when  6  is  small.  The 
number  M  of  possible  allocations  is  very  large, 
even  for  moderate  N,  making  it  difficult  to 
obtain  the  exact  distribution.  This  is  overcome 
by  taking  the  approximate  d i s t r ibut ion ,  using 
the  first  four  moments  of  <5,  see  Tracy  and 
Tajuddin  (1985),  An  efficient  choice  of  when 

the  N|'s  are  not  equal  is  N^/K,  as  suggested  by 

Mielke  (1984,  p.8l7).  Mielke,  Berry,  Brockwell 
and  Williams  (1981)  considered  a  special  case 
of  when  r=l  and  measurements  are  replaced  by 

their  ranks  R(wj)  in  the  combined  sample.  Then 
”  I R( w j ) - R( Wj ) I ^ ,  v>0.  For  v*l,2,  they 
denoted  the  test  statistic  by  6^,  Using  the 

Pearson  criterion,  appropriate  Pearson  Type 
curves  are  suggested  by  Tracy  and  Tajuddin 
(1985),  and  power  performance  studied  for  equal 


sample  sizes  by  Tracy  and  Tajuddin  (1986).  For 
Brockwell,  Mielke  and  Robinson  (1982)  show 
that  has  2  non-normal  non- invariant 
distribution,  and  that  its  asymptotic  distribu¬ 
tion  depends  on  the  underlying  population. 

In  this  paper,  we  compare  the  power  perfor¬ 
mance  of  and  6^  for  N^,  N2  unequal  and  equal 

with  N=80  when  the  underlying  d is t r ibu C ions  are 
normal,  Cauchy  and  Laplace.  Based  on  a  simula¬ 
tion  study,  5000  samples  are  generated  using  IMSL 
subroutines.  The  first  observations  are 
shifted  by  ko,  where  k  proceeds  from  0  to  l.O,  at 
steps  of  0.2.  To  obtain  empirical  powers,  the 
number  of  rejections  is  uuunLcd  for  a  -  .001,  .01, 
.05  and  .10.  The  appropriate  approximat ions  for 
the  distributions  of  and  6  are  Pearson  Type 
VI  and  Type  I  respectively.  The  empirical  powers 
of  two  standard  score  tests  -  median  and  Fisher, 
are  also  studied  for  comparison  purposes.  The 
results  are  presented  in  Tables  1-3,  and  some 
typical  graphs  are  also  drawn.  On  interchanging 
the  roles  of  N^  and  N^,  the  powers  of  the  test 
statistics  remain  more  or  less  the  same  for 
symmetric  underlying  populations.  Hence  we  only 
present  the  case  of  Nj^>^N2»  for  N ^^=70 ,60 , 50  ,40. 

If  we  were  to  use  only  the  first  three  moments 
of  6p  62,  the  appropriate  approximation  to  their 

distribution  is  Pearson  Type  III»  see  Mielke, 
Berry,  Brockwell  and  Williams  (1981).  For 
a  *  .001,  Che  powers  of  6.  under  Pearson  Type  III 
approximation  are  higher  than  those  under  Type  VI 
approximation  for  all  shifts  and  sample  sizes. 
However,  for  a  =  .01,  .05  and  .10,  the  powers  of 
6^  under  Type  VI  approximation  are  slightly 
higher  than  those  under  Type  III  approximation 
for  all  sample  sizes  except  ( , N2 )  =  ( 70 , 10 ) , 
(10,70).  Powers  of  6^  under  Type  I  approximation 
are  slightly  lower  than  those  under  Type  III 
approximation  for  a  =  .01,  .05  and  .10,  but  the 
situation  is  the  other  way  round  for  ct  =  .001. 

For  N  =  N^  and  a  =  .01,  both  approximations 
give  the  same  value. 

For  normal  underlying  populations,  powers  of 
6  arc  higher  than  those  for  6^.  For  a  =  .001, 
trie  power  of  Fisher  test  is  lower  than  that  of 
6^  but  generally  higher  than  that  of  6^.  For 
a  =  .01  and  for  lower  shifts  the  power  of  the 
Fisher  test  is  lower,  but  for  higher  shifts  and 
for  all  other  a’s,  the  powers  of  Fisher  test  are 
higher . 

For  heavy-tailed  distributions  (Cauchy  and 
Laplace),  the  powers  of  6.  are  higher  than  those 
of  Powers  of  Fisher  test  are  lower  than 

those  of  6j,  ^2  and  median  test. 

In  most  cases  the  empirical  levels  of  signifi¬ 
cance  for  the  median  test  are  not  within  2 
.standard  deviation  limits,  and  hence  sho\ild  not 
be  compared.  However,  where  comparable,  the 
median  test  has  slightly  high  power  than  the 
other  tests,  but  for  underlying  normal  population 
it  gains  the  least  powers. 
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Table  1.  Empirical  Powers  when  the  Underlying  Distribution  is  NORMAL 
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2888  .3108  .2380  .3254  .4052  .4304  .2736  .4462  .4754  .4988  .3944  .5142  .5052  .5324  .4512  .5490 

4942  .5292  .3862  .5540  .6988  .7268  .5152  .7470  .7782  .8052  .6806  .8210  .8042  .8276  .7184  .8414 

6872  .7204  .5464  .7518  .8874  .9064  .7304  .9196  .9404  .9536  .8670  .9602  .9616  .9668  .9070  ,9684 

8502  .8792  .7230  .8980  .9756  .9796  .8804  .9850  -9924  ,9954  .9664  .9966  .9944  .9962  .9746  .9980 


Table  2.  Empirical  Powers  when  the  Underlying  Distribution  is  CAUCHY 


f'J  00  w  o 
so  vO  <M 
O  O  ^  lti 

O  O  O  <f 


<r  o  <r  >0  00  -^f 

00  00  00  -<0 
o  o  00 

O  O  O  <N  in  00 


sO  O  -st  vO  0  CO 

O'  in  CO  in  m 

O  O  nO  — '  cn 

O  O  O  ^4^  -O 


•<j-  \0  csj  CM  Csi  >4 

fsj  --H  m  o  n.  i-i 

O  GO  O  nj 

O  O  O  fn  in  00 


o  o  CN  o  m  -4 

O'  ni  CO  <0  in 
O  fn  -4  og  m  fn 
O  O  m 


•4  -4  00  O  00  -4 

cn  m  O'  >4  fn  00 
•— •  n.  <— *  i/>  00  »o 
O  O  cn  00  O' 


O  00  o  o  sO  00 

^  O  m  00  n. 
*-•  in  »-<  sD  *-•  00 
O  O  nj  s4  n*  00 


CM  nO  -4  CM  -4  O' 

-H  O'  <n  in  o 

•  m  nO  cn  m 

O  O  r-4  in  00  O' 


^CMQOCMvO-O 
O'  CO  O'  cn  CM 

in  <  CM  o  o 

O  m  O' 


>4  sD  >4  CM  4 

4  4  m  CO  CM  O 

4  lO  O'  O  in  O' 

O  4  00  O'  O' 


CO  00  O  O  CM  o 

O  •— *  O'  CO  O  cn 

m  4  O'  O'  sD 
O  4  'D  00  O' 


4  4  O  O  »£»  4 

>— •  OO  nD  O  4  00 

in  tn  00  O'  4  CO 

O  4  O'  O' 


4  O  'D  CM  CM  lO 

O  4  — '  00 

o  o  cn  o  <n  o 

C  O  O  CM  4 


00  CM  CM  4  OO  CM 

4  4  in  in  in 

O  'n  CM  — •  CM  O 

O  O  cn  in 


f'i  O  O  00  O  lO 

co  cn  1-^  cn  41  O' 

4  CM  ^  in  4 

O  «n  n.  oO 


CM  O  O  CM  sD  O 

lO  O  4  4  cn 

O  O  m  ^  \0  O' 

O  O  O  CM  4 


4  «£i  CM  CM  O  >0 
O  CO  4  O'  O  n. 

o  O  in  O'  CM 
O  O  O  — '  4  >£> 


00  CM  sO  4  OT)  -o 
O  — '  cn  r-v  *-<  rg 
O  O  n.  4  n. 
O  O  O  CM  m  r- 


00  4  4  O  CM  00 
O  cn  •—  O  O'  00 
O  O  CM  n>  sO  O 
O  O  O  O  ^  CM 


>0  sO  00  'O  CM  4 

O  cn  cn  sO  cn  o 
O  O  cn  CM  o  sO 
O  O  O  ^  c*-»  4 


O  00  O  00  CM  00 

in  O'  4  4  in 

o  O  cn  CM  o  00 

o  O  O  cn  4 


00  4  O  CM  O  00 

O  n.  CM  CM  in 

O  O  in  r-.  4  ^ 

O  O  O  O  vO 


4  00  O  ''I  4  ni 

in  O'  — '  00 

O  O  O  — '  in  o 
o  o  o  o  o  — • 


4  OO  c-g  CM  c^J  vO 

■-J  4  4  nD  in  O' 

O  O  4  00  4 

O  O  O  O  O 


4  00  sD  iM  O  4 

^  CM  rs  O  "n  nD 
O  O  —i  m  — i 
O  O  O  O  ^  C'J 


ri  O  -.O  >0  4  O 
— <  4  m  in  in  >0 
00^^41^ 

O  O  O  O  — »  ni 


CM  sO  4  4  O  CM 

^  nj  O  O'  in  nO 

^  sO  <0  cn 

O  O  CM  in  00  O' 


4  CM  oO  sO  ■£>  4 

oo  <n  oo  O'  4 

O  4  oO  in  00  in 

O  O  4  vO*  00 


nO  4  4  00  sO  00 
O'  cn  30  o  00  4 
O  in  cn  4  O'  cn 
O  O  CM  in  O' 


4  O  CM  00  4  CO 
30  00  00  in  4  O 
O  CM  O'  CM  CM  O' 
O  O  O  CM  4  m 


4  O  O  4  O  O 

cn  cn  4  0^  in 

O  CM  — •  CM  O' 

O  O  ^  CM  in  sO 


nO  00  CM  CM  4  CM 
cO  cyi  O  CT'  4  CM 
O  cn  in  m  n.  4 
O  O  ^  cn  in 


vO  OO  O  OO  CM  CM 
O'  «n  r«-  m  m  m 
O  4  CO  CM  00  4 
O  O  — <  4  'O  00 


on  vO  <M  00  lO  4 

I-'  X)  00  4  in  CM 
O  — •  4  — '  — '  4 
O  O  O  CM  cn 


«0  CM  4  sD  4  >0 
4  r-.  — *  4  •£>  00 
— *  cn  o  c^i  4  in 
O  O  — ^  c'j  cn 


4  r-j  4  O  O  nO 
co  *n  in 

O  c-i  oo  c-j  pN. 
o  o  o  cn  4 


C'g  4  >0  4  ng  sO 
00  sO  O'  O'  CM  O 
O  ng  00  ^  O'  m 
O  O  O  ni  cn  m 


CM  4  CM  CO  O  CM 

sO  in  CM  cn  4  cO 

(n4cnini-^n- 

O  ‘  4  O'  O' 


sO  CO  ^  CO  4  4 
00  CO  c?'  .D  in  ^ 
4  4  O'  CO  lO  in 
O  '  cn  ^  oo  O' 


CM  CM  O  CO  O  4 

CO  in  in  CM  cn 

4  «J0  ,L>  n»*  m  oo 

O  4  O'  O' 


sO  4  CO  O  lO  >£> 

cn  O'  4  lO  lO 

in  o  in  sD  lO  O' 

O  ^  ng  4  ^ 


CM  CM  sD  4  O  CM 
4  r>*.  oo  lO  in  cn 
sD  p~-  in  I— <  cO  4 
O  — *  4  P«.  CO  O' 


CM  CM  4  00  CO  --O 

cn  cn  CM  CO  sD  00 

tn  CM  cn  00  00 

O  ^  cn  in  CO 


4  4  CM  4  CM  4 
O'  lO  m  cn  cn  4 
4  cn  CO  >0  iD  4 
o  ^  cn  vD  CO  O' 


>0  sO  4>  O  o  CO 

iTi  rn  X  '  '  'I'l  ' 
4  CO  m  O'  '£?  O 
O  O  nj  4  'O 


CM  O  CM  00  O  O 
cn  o  *“*  n.  *n  O 

C30  >0  CM  O  sD  M3 

O  ^  cn  m  n- 


00  O  CM  4  00  ng 

sO  c^  O  CM  4  OO 
4  O'  O  n.  m  CO 
O  O  nj  cn  ic'  4) 


30  4  O  00  sO  O 

--  4  >0  't  cn 

4  O'  r-j  r-g  sO 

O  O  C'g  4  vO  c-* 


O  C'j  4  sO  00  O 

0  0  0  0  0-“* 


O  M  4  -O  00  O 

00000  — 


O  <N  -O  00  O 


.310-1  .32  12  .26  18  .5030  .4482  .4586  .3696  .5782  .5224  .6136  .4300  .6090  .5394  .6692  .4472 
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Table  3.  Empirical  Powers  when  the  Underlying  Distribution  is  LAPLACE 
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3910  .3886  .3610  .3638  .5870  .5698  .5090  .5264  .6750  .6538  .6838  .6032  .7104  .6902  .7396  .6485 

6556  .6550  .5854  .6234  .8650  .8496  .7896  .8142  .9294  .9186  .9178  .8858  .9374  .9280  .9420  ..8954 

8340  .8250  .7436  .8140  .9718  .9636  .9312  .9496  .9904  .9872  .9872  .9872  .9942  .9910  .9916  .9826 

9458  .9352  .8638  .9268  .9976  .9956  .9852  .9914  .9994  .9996  .9978  .9976  1.000  .9998  .9996  .9992 


POWER  OF  MRPP  TESTS 


UNDERLYING  DISTRIBUTION; 


NORMAL  N1  =60,  N2  =  20 


CAUCHY  N1 =60.  N2=20 


POWER  OF  MRPP  TESTS 


UNDERLYING  DISTRIBUTION- 


LAPLACE  N1-40,N2  =  40  CAUCHY  N1=40,  N2  =  40 


NORMAL  N1=40, N2=40 
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Performance  of  Several  One  Sample  Procedures 
Dautd  L.  Turner  and  YuYu  Wang,  Utah  State  University 


Abstrac  t 

Empirical  p-values  and  powers  for  the 
usual  t  test,  the  signed  rank  test,  a 
trimmed  t  test,  a  Jackknife  and  a 
bootstrap  procedure  were  compared  using 
repeated  samples  of  size  30  from  normal, 
double  exponential,  Cauchy,  negative 
exponential  and  uniform  distributions 
for  normal  power  values  ranging  from 
0.05  through  0.95.  The  Bootstrap 
performed  as  well  as  the  usual  t  test. 
The  trimmed  t,  signed  rank  test  and  the 
usual  t-test  performed  about  the  same. 
The  Jackknife  performed  worst  among 
these  tests.  The  signed  rank  test  did 
best  for  the  Cauchy  distribution. 

1 .  I n  troduc  t i on . 

The  Jackknife  and  Bootstrap 
procedures  as  described  by  Efron<1982) 
seem  to  place  an  inordinate  amount  of 
emphasis  on  the  sample  and  how  well  it 
approximates  the  true  underlying 
cumulative  distribution  function  <cdf>. 
To  investigate  the  performance  of  these 
two  techniques  relative  to  some  more 
standard  or  usual  tests,  a  monte  carlo 
study  was  run.  Table  1  lists  the  5 
different  test  statistics  and  methods 
used  to  compute  p-values  for  each  test. 
Samples  were  generated  from  a  standard 
normal,  a  double  exponential,  negative 
exponential,  uniform  and  Cauchy 
distributions.  Each  started  wi  th 
uniform  deviates  generated  by  the 
portable  congruent  iai  random  number 
generator  given  by  Wichmann  and 
Hill<1987).  Each  distribution  was 
scaled  to  have  mean  0  and  variance  1. 
The  Cauchy  was  scaled  to  have  zero 
median  and  the  same  interquartile  range 
as  the  standard  normal . 

100  trials  were  run  for  each 
combination  of  distribution  and 

sample  size.  After  some  prel  iminary 
runs,  100  bootstrap  samples  were  deemed 
adequate  for  the  demonstration  purposes 
of  this  paper . 


2.  p-value  Analysis  for  H^s  |i  =  true 

The  first  analyis  focused  on  the  case 
when  the  null  hypothesis  was  true,  i.e. 
when  P  was  indeed  equal  to  P 
Initially  the  t  distributi on  was  used  ^o 
calculate  p-values  for  the  Jackknife  and 
bootstrap  procedures.  This  consistently 
gave  average  p-values  slightly  larger 
than  0.5.  A  more  “ n onp ar ame t r i c "  p- 
value  was  then  implemented  for  these  2 
procedures  which  defined  p  as  2Imin(#?'s 
<  •‘o,H?'s  >  P^)). 


Since  the  null  hypothesis  was  in  fact 
true,  the  p-values  should  have  followed 
a  uniform  distribution.  Figure  1  plots 
the  empirical  cdf  of  the  p-values 
against  the  cdf  for  a  uniform 
distribution  for  samples  from  each  of 
the  distributions  when  n  =  30.  A  45® 
line  for  these  plots  was  regraded  as  the 
standard,  indicating  no  departure  from 
the  underlying  model. 

p-values  for  the  Cauchy  distribution 
tend  to  clump  or  cluster  in  the  middle  a 
little  more  than  they  should  for  the  t 
and  the  trimmed  t.  The  negative 
exponential  also  has  'light  tails*  for 
the  signed  rank  test.  The  bootstrap 
comes  surprisingly  close  to  the  optimal 
45°  line,  indicating  that  the  p-values 
for  a  true  null  hypothesis  are  close  to 
uniformly  distributed.  The  p-values  for 
the  Jackk^’f^  shOM»  a  completly  different 
story  however.  It  appears  to  be  very 
difficult  to  get  a  Jackknifed  7;  more 
extreme  than  the  original  as  indicated 
by  the  fact  that  far  more  small  p-values 
were  obsc:vi.*d  than  expected. 

3.  p-value  Analysis  for  I*  =  i*o  false. 

The  next  step  in  the  analysis 
involved  specifying  values  for  the 
to  make  the  power  of  the  t-test  take  on 
the  values  0.05,  .10,  .25,  .50,  .75,  .90 
and  .95  for  n  =  10  .  Figures  2  and  3 

plot  the  average  p-values  for  100 
repetitions  of  each  1*^^,^^  value  tor  each 
distribution.  Figure  2  shows  how  poorly 
the  Jackknife  does  across  the  5 
distributions  considered  here.  It  is 
surprising  to  see  how  little  fluctuation 
there  is  among  the  tests  for  samples 
from  all  but  the  Cauchy  distribution. 
Each  of  the  tests  except  the  Jackknife 
seems  to  be  fairly  robust  to  departures 
from  normality.  The  signed  rank  test  is 
the  clear  winner  for  the  Cauchy 
distribution,  having  a  curve  more  like 
those  for  sampi ing  from  the  normal 
distribution.  Figure  3  plots  the  same 
values  but  with  a  different  arrangement. 
Each  test  seems  to  do  poorly  for  the 
Cauchy  distribution  except  the  signed 
rank  test  . 

4.  Empirical  power  analysis 

The  null  hypothesis  for  this  study 
was  rejected  for  any  particular  run  if 
the  empirical  p-value  was  less  than  a  - 
0.05.  Figures  4  and  5  plot  thr 
empirical  power  values  for  the  5  testi. 
against  the  5  distributions  plotted 
against  the  7  different  M  values. 
Figure  4  shows  very  close  agreemeni 


among  all  tests  but  the  Jackknife  for 
all  but  the  Cauchy  distribution.  For 
the  Cauchy,  the  signed  rank  test  comes 
closest  to  providing  an  'ideal*  power 
curve.  The  bootstrap  shows  a  slight 
improvement  over  the  t  test,  but  does 
not  do  as  well  as  the  trimmed  t.  The 
jackknife  almost  always  rejects,  and 
seems  to  differ  but  little  as  the  value 
of  changes.  Figure  5  shows  that 

all  but  the  signed  rank  test  provide 
very  low  power  for  the  Cauchy 
distribution. 

5.  Summary  and  Conclusion 

For  the  5  distributions  considered 
here,  the  bootstrap  did  surprisingly 
well  as  did  the  signed  rank  test. 
Equally  surprising  or  comforting  was  how 
well  the  t  test  did.  For  these  5 
distributions,  the  usual  t  appears  to  be 
robust  enough.  The  present  formulation 
of  the  J  ackkn  ife  is  not  useful,  being 
far  too  liberal  and  rejecting  far  more 
often  than  it  should. 

Further  research  in  this  area  could 
include  extreme  value  distributions  and 
perhaps  a  mixture  distribution  such  as 
0.9N(0,1)  +  0  .  INCO  , 100)  ,  or  more  “L* 
shaped  distributions.  More  complex 
procedures  could  also  be  tried. 
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Table  1.  One  sample  rocedures  compared. 


Procedure 

Test  Statistic 

p-value  calculation 

t  statistic 

(?  -  Pj,)/(S/.|n) 

t  distribution 

Trimmed  t 

t  for  trimmed  data 
< de ’  ■  ■ ng  1 ower  5X 

and  upper  5%) 

t  distribution 

Signed  Rank  t 

t  for  signed  ranks  of 
‘Y.  -  ►'o> 

t  distribution 

Jackkn i f e 

? i ' s  computed  by 
deleting  each  Y, 

2tmin<H?'s  <  Mp,«?'s  > 

Bootstrap 

Y-'s  computed  from 
random  samples  with 
replacement  from  the 
data 

2tmin<ltY's  <  Pj^,«Y's  > 

•'o>> 
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Figure  1-  Plot*  o-f  the  empiric*!  cdt  of  the  p-walues  against  the 
uniform  cd-f  for  each  test  and  distribution  for  n  “  30. 
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igure  2.  Plots  ot  the  average  p-values  for  n  =  30  for  each 
listributlon  and  test  combination  for  false  H^^'s. 
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XV.  TIME  SERIES  ANALYSIS 


Computational  Aspects  of  Harmonic  Signal  Detection 

Keh-Shin  Lii,  Tai-Houn  Tsou,  University  of  California,  Riverside 

Time  Series  in  a  Microcomputer  Environment 

John  D.  Henstridge,  Perth,  Western  Atistralia 

Moving  Window  Detection  for  0-1  Markov  Trials 

Joseph  Glaz,  Philip  C.  Hormel,  Bruce  McK.  Johnson,  University  of 
Connecticut  and  CIBA~GEIGY  Corporation 

Inference  Techniques  for  a  Class  of  Exponential  Time  Series 
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Alternative  Methods  for  Computing  the  Theoretical  Auto  Covariance  Function  of 

Multivariate  ARMA  Processes:  A  Comparison 
Stefan  Mittnik,  SUNY  at  Stony  Brook 


COMPUTATIONAL  ASPECT  OF  HARMONIC  SIGNAL  DETECTION 


Keh-Shin  Lii,  University  of  California,  Riverside  and  Tal-Houn  Tsou,  University  of  California,  Riverside 


1 .  INTRODUCTION 

We  consider  a  model  of  the  form 

-  Yj.  +  Zj.  (1) 

with  Yj.  a  periodic  function  given  by 
K  K 

where  Rj^,  oOj^,  and  0^^  are  the  amplitude,  frequency 
and  phase  of  the  harmonic  process  Y^.,  is  an  ad¬ 
ditive  noise  process  which  is  Independent  of  Y^.. 
When  is  white,  Schuster  (1898),  Fisher  (1929), 

Whittle  (1952),  and  Siegel  (1980)  discussed  how  to 
detect  the  harmonic  signal  Y^ .  In  many  applica¬ 
tions  of  engneering,  meteorology  and  ecology  prob¬ 
lems,  the  background  noise  may  not  be  white,  or  it 
can  be  represented  as  a  linear  process,  such  as 

(3) 


Z^  =  I  a  e . 
u 


t  u  n  t-u 
where  e^'s  are  Independent,  identically  distri¬ 
buted  and  Ou's  are  constants.  Usually  we  assume 

2 

that  has  mean  zero  with  variance  .  Whittle 

(195A),  Bartlett  (1957),  and  Priestley  (1962  a,b) 
dealt  with  the  testing  and  estimation  problems, 
when  the  noise  process  is  assumed  to  be  colored. 

To  motivate  our  procedure,  we  first  consider 
the  Fisher's  test.  This  is  the  case  when  the 
noise  process  Z^  is  assumed  to  be  zero  mean 

2 

Gaussian  white  noise  with  variance  a  .  The  null 
hypothesis  can  be  stated  as 

Hq  ;  the  harmonic  signal  Y^  Is  zero  in  (1). 

Under  H^,  the  periodogram  of  the  process  X^., 

I^(X.)  =  (2/N)|EN^jX|.exp(-itAj)p  with 

A  “2iij/N  has  a  Chi-square  distribution  with  2 

2 

degrees  of  freedom,  if  it  is  divided  by  o  . 

Furthermore,  1*[A.1  and  I^fA,]  are  Independent 
N  N  k 

if  j^k  with  A^=2iim/N  for  J  ,  k=l ,  2 ,...,[  N/2  ] . 

This  result  also  holds  asymptotically  when  the 
noise  process  is  independent,  identically 

distributed,  but  not  necessarily  Gaussian 
(Brilllnger  (1975)  p,94].  Based  upon  the  previous 
result,  Fisher  (1929)  derived  the  exact  distri¬ 
bution  for  the  test  of  the  largest  peak  of  the 
periodogram,  I.e. 

max  Ijf  A  ] 


,(f) 


Kj<[N/2) 


‘N'  J- 


(4) 


Kj<(N/2) 

J  process  i 

form  given  In  (3)  with  the  conditions  that  Ele 


When  the  noise  process  is  linear  which  has  the 

tr 

<  and  7,  ja^jjuj  <  then  the  power  spectrum  of 
Zt  is 


f^(A)  =  (a^^/2n)jz  exp(-iuA)j^ 

and  its  periodogram  has  the  following  relationship 
with  the  periodogram  of 

IE(A)  =  |Z  o^  exp(-iuA)|2  IE(X)  +  Rj^(A) 

where  Rjj(X)  =  0(1/N)  uniformly  in  A,  [Priestley 
(1981)  p.4241.  From  this,  it  is  clear  that 
asympototlcally 


I^(X) 


|r  a  exp(-iuX) 


(5) 


if  |z  a^exp( -iuA) I *0  for  all  A.  The  asymptotic 
distribution  of  is  known  from  the  previous 

discussion.  This  observation  motivated  various 
methods  which  attempt  to  estimate  the  spectrum  of 
the  noise  process  Z^  which  is  proportional  to 

jz  o^  exp(-iuA)j^  under  the  null  hypothesis  Hg . 

We  will  first  review  some  conventional  methods 
which  include  Whittle's  test,  Bartlett's  test  and 
Priestley's  P(A)  test. 

(I)  Whittle's  test  -  Whittle  (1952,  1954) 

The  basic  idea  of  this  approach  is  to  use  the 
asymptotic  relationship  between  the  periodogram  of 
the  general  linear  process  jz^|  and  that  of  the 
residual  process  (e^I.  Following  Fisher's  tests 

in  equation  (4),  Whittle  proposed  the  test  stat¬ 
istic 


Max  (  iRfA  J/ZiTfEfA, 
,  N  j  J 

,(w)  J 


)) 


(6) 


(i;5(Aj]/2.fE(A.n 


where  j  =  l,...,n,  n=[..72]. 

( w ) 

Under  H_,  G  is  asymptotically  distributed  as 
( f )  ^ 

o  .  The  problem  is,  the  actual  spectral  density 
function  f^(A)  of  is  usually  unknown.  The 

remedy  for  this  is  to  use  the  estimated  power 
spectrum  of  jz^j  as  a  substitute  for  fE(Aj). 

(II)  Grouped  periodogram  test  -  Bartlett  (1957) 
This  method  divides  the  periodogram  ordinates 
Into  r  =  (N/2k]  sets,  each  set  containing  k  ordi¬ 
nates.  When  k  Is  relatively  small  compared  with 
N,  the  spectral  density  function  in  this  region  is 
almost  flat.  Thus,  on  the  frequency  domain,  if  we 
choose  the  bandwidth  of  the  smoothing  kernel  small 
enough,  the  estimated  power  spectrum  of  I t  will 
be  almost  constant  for  these  k  ordinates,  except 
the  harmonic  terms  in  the  frequency  u.  Let 


,(B) 


Max  l^^A  1/  7 

(r-l)k+l<j<rk  ^  j=(r-l)k+l  ^ 

-(g) 


(7) 


under  Hq,  Gj^'’  has  approximately  the  same  distri¬ 
bution  as  Fisher's  test  with  k  degrees  of  freedom. 
(Ill)  P(A)  test  -  Priestley  (1962  a,b) 

The  Idea  behind  this  test  is  to  use  properties 
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of  the  autocovariance  function  (ACF)  of  and  Z^. 

It  Is  assumed  that  the  ACF  of  the  general  linear 
process  will  die  out  as  lag  u  ♦  and  the  ACF 

of  the  harmonic  process  Yj.  will  persist  even  for 

large  u.  Let 

N-1  (  \  ) 

P(X)  =  (1/2ti)  I  (K  ^‘\u) 
u"- N+1  " 

-  K  (u))C(u)exp(-iXu)  =  f’'(X)  -  f*(X), 
ra  n  ra 

where  C(u)  is  the  sample  autocovariance  function 

and  K^^^(u),  K^^^u)  are  two  symmetric  sequences 
n  m  I  ■ 

of  weight  function  such  that  both  decrease  as  lul 

increases  and  m/n+0,  n/N+0,  as  m>“,  n>"  and  N+". 

To  detect  the  harmonic  process,  we  plot  P(X) 
vs.  X  and  test  the  significance  of  Che  large 
peaks.  The  standardized  cumulative  suns  can  be 
defined  as 

-  q 

/N/fmA  1  Z  P(2iii/m) 
n,m'  < ;.i 

=  - U -  ,  (8) 

{(l/27r)G(x)}‘^^ 


where  q  =  0, 1 , . . . ,  (m/2  ] ,  and  G(ii)  is  estimated  by 

G*(it)  =  (l/4ii)l  m  C^(u).  Detail  of  this  dls- 
u=-  m 

cusslon  can  be  found  in  Priestley's  (1981); 

All  these  methods  are  based  on  the  second  order 
moments  of  the  time  series.  When  the  noise  pro¬ 
cess  is  Gaussian  then  moments  up  to  second  order 
give  all  the  information.  If  the  noise  process  is 
non-Gaussian  then  cumulants  of  order  greater  than 
two  might  provide  extra  information  in  addition  to 
those  of  less  than  or  equal  to  second  order  mo¬ 
ments.  In  next  section  we  will  present  a  method 
which  will  take  advantage  of  third  and  fourth 
order  cumulants  to  improve  Che  efficiency  of  de¬ 
tecting  the  existence  of  Che  periodic  function  Y^, 

under  the  assumption  that  the  noise  process  is 
non-Gaussian. 

2 .  TEST  STATISTICS 

For  simplicity,  we  will  study  processes,  Y^,  and 
Z^  separately.  Assuming  X^.  is  stationary  up  to 
order  eight,  and  all  cumulants  are  summable  up  to 
the  eight  order.  The  blspectrum  and  the  trlspec- 
trum  is  the  Fourier  transformation  of  third  and 
fourth  order  cumulant  function,  i.e. 


f’'(M.X2) 


f’'fXi,X2,X3) 


(2")  ^u^v!L^C’'(u,v)exp(-luXi-lvX2) 

fy(Xi ,X2)  +  f^(Xi ,X2) 

(2Tt)  ^  Z  c’‘(u,  V, s )exp( -luXi  - 
U,V,S"-® 


IVX2-ISX3)  .  fy(Xi,X2,X3)  +  f^(Xi ,X2,X3).  (9) 

Assume  that  the  harmonic  process  contains 

only  one  harmonic  component,  i.e. 


Yf  -  Rcos(ut+0)  (J  -•  UC-x  ,11 ) . 


Let  «n(9)  *  sIn((N+( l/2))e)/2xsin(0/2)  be  the 
Dlrlchlet  kernel.  Then,  on  the  particular 
submanifold  (X,0)  and  (X,0,0),  we  have 


1^(X,0)  =  0(1/’J) 

iy(x,o,o)  =  (jRVieliSjjCu-x) 

+  «jj(<v+X)J^^^(a)).  (10) 

Now  consider  the  noise  process  !z^|.  When  12^.1 
is  non-Gaussian  linear  process  with  3rd  (4th) 
order  cumulants  (y^)  of  e  exist,  then  the  bi¬ 
spectrum  and  the  trispectrura  can  be  represented  as 

f^fXi,X2)  =  (Y3/(2x)^)r(xi)rfx2)r*(Xi+X2) 
f^(Xl,X2.X3l  =  (Y^/(2x)^lrfXi]r(X2lrfX3] 
r*(Xi+X2+X3) 


with  r(X)  =  Z  a^exp(-iuX). 

When  the  noise  process  is  given  in  (3)  such 
that  the  third  order  cumulant  of  c  is  nonzero, 
the  blspectrum  of  the  process  X^.,  on  the  submani¬ 
fold  (Xi,X2)  =  (X,0),  can  be  represented  as 


fX(X,0)  =  f2(X,0)  =  (2x)-2  Y3r(0)jr(X)|2. 
Hence,  when  1(0)  *  0, 

|r(X)|2  =  Dif’'(X,0) 

with  Di  =  (2x  )^/(y31(0)) .  From  these  discussions 


the  following  test  statistics,  from  (5), 
posed 


^(b) 


max 

1<J<[N/21 

Z 

Kj<lN/2l 


Ix(X^)/RfxfX^;0) 


J-1.2 


is  pro- 


..,[N/2] 

(11) 


where  Rf’^(X,0)  is  a  consistent  estimator  of  the 
real  part  of  the  bispectrum  f’'fXi,X2)  at  fre¬ 
quency  (X,0),  and  the  unknown  constant  Di  is 
cancelled  out. 

When  E  is  symmetric  distributed  or  y3  “  0 
with  nonzero  4th  order  cumulant  Y^,  equation  (9) 
implies 

D2f’‘(X,0,0)  =  D2fy(X,0,0)  +  |lU)|^ 

where  =  ( 2x )^ /y^l^ (0) .  According  to  (10),  the 

bias  around  the  harmonic  frequency  using 

02  fjl5(^  >0  tO)  is  generally  smaller  than  using 

2x/o  ^f<((X)  due  to  smoothing.  Thus,  Che  tri- 
G  N  1  1 2 

spectrum  of  X^.  provide  an  estimate  of  r(X) 

which  has  smaller  bias  than  the  methods  using  the 
power  spectrum.  The  following  statistics  are  used 
to  detect  the  harmonic  component 


g(^>- 


raax  ix(u)  )/Rfx(x  0,0] 
l<j<lN/2l  J 


Z 

1_<)<[N/21 

J  *1 ,2  .  • . , 


[n/2) 


(12) 


where  Rfj^(Xj,0,0)  is  the  estimated  real  part  of 
the  trispectrura  at  fXj.O.Ol. 

3.  ESTIHATION  AND  COMPUTATION  OF  NKU  STATISTICS 

Since  the  blspectrum  and  trispectrura  are 
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defined  as  (9),  their  natural  estimates  are 


N-1 

-  l/(2nr  1 

u,  v«-N+l 

C>^(u,v)exp(  -lAj  U-IA2  v)  , 

N-1 

155(^1.^2,13)  =  -,(2n)'  Z 

u,v,s=-N+l 

C’'(u,  v,s)exp[  -il;  u,-1A2  v-llas) 


N-1 

fJ5(ll,l2]  =  (2ti)-2  T.  (u,v)C>'(u,v)  (13) 

”  u,v=-N+l  ' 

exp( -iuA I -ivA2 ) 

N-1 

"  u,v,s=-N+l  2 

C’^(u,v,s)exp(-iuAi-lvA2-isl3) 

(U) 


respectively,  where 
C’^(u,v)  =  m’^Cu.v), 


N-max(u, v,0) 

mMu.v)  =  (l/Ti)  J  ^./tWt.v 

t— min(u,  v,0)+l 

with  -  N+Ku,v^N-l, 

Ti  =  N  -  inax(u,v,0)  +  mln(u,v,0)  and 

C’'(u,v,s)  =  m’'(u,v,s)  -  m'‘(u)m’^(v-s ) 

-  ra’'(v)m’'(u-s)  -  m’‘(s)m’'(u-v) , 

N-max(u,v,s,0) 

mx(k-h)  =  (1/T2)  2 

t=-mln(u, v,s,0)+l 


N-max(u, v,s,0) 
mX(u,v,s)  =  ( I/T2)  I 

t=-min(u, V, s,0)+l 

-  N+Ku,  v,s,_<N-l , 


where  T2  =■  N  -  max(u,v,s,0)  +  raln(u,v,s,0)  and 

(h,k)  is  any  two  elements  partition  of  the  set 
(0,u,v,s). 

Noticed  that  the  third  and  fourth  order  perlod- 
ogram  are  not  consistent  estimator  of  bispectrum 
and  trispectrum,  Rosenblatt  and  Van  Ness  (1965), 
Blllinger  and  Rosenblatt  (1967),  mentioned  two 
different  approaches  to  estimate  the  blspectrura 
and  trispectrura  consistently.  One  way  by  Fourier 
transform  the  smooched  3rd  and  Ath  cumulant 
function,  the  other  by  smoothing  the  3rd  and  Ath 
order  perlodogram  function. 

The  advantages  for  the  later  approach  are  re¬ 
ducing  the  computational  time  and  saving  Che  com¬ 
puter  storage.  Once  the  Fourier  transformation  of 
the  random  process  is  given,  the  rest  of  calcula¬ 
tions  for  3rd  and  Ath  order  perlodogram  is  just 
the  multiplication  on  different  frequencies.  The 
disadvantage  is  that  it  falls  to  provide  the  bl- 
spectrum  and  trispectrum  on  the  submanifolds.  For 
the  submanifolds,  Blllinger  and  Rosenblatt  (1967) 
suggested  averaging  values  in  a  neighborhood  of  a 
submanifold  to  approximate  the  exact  calculation. 
In  contrast,  the  former  approach  requires  larger 
computer  memory  and  longer  computational  time  to 
calculate  the  3rd  and  Ath  order  perlodogram  but  it 
provides  direct  estimates  of  blspectrura  and  tri- 
spectrum  for  all  frequencies  Including  those  of 
the  submanifold. 

Since  we  are  only  Interested  in  the  blspectrura 
and  trispectrum  on  certain  submanifolds,  we  then 
focus  on  the  time  domain  smoothing  method  only. 

The  estimate  of  blspectrura  and  trispectrum  can  be 
obtained  by 


where 


K^^(u,v)  -  K(u/Mi,v/Mi) 


and  K^^(u,v,s)'- 


k(u/M2 , v/M2 ,s/M2 )  are  2-  and  3-  dimensional  lag 

window  with  sequence  of  constants  (Mj}  and  fM2}, 

2 

which  tends  to  infinity  ad  N  +  and  /N  + 

0  M2^/N  +  0.  A  easy  way  to  create  the  2-  and  3- 

dimenslonal  lag  window  Is  taking  the  product  of 
one  dimensional  lag  window.  Based  on  equation 

(13)  and  some  assumptions,  Rosenblatt  and  Van  Ness 
(1965),  actually  derived  the  mean  and  variance  of 
blspectrura  estimate  which  are  given  as  theorem  A 
and  theorem  5  In  their  paper.  More  general 
results  including  trispectrum  are  given  by  Lii  and 
Rosenblatt  (1988).  From  these  discussions  and 
equations  (13)  and  (lA),  we  have  consistent 
estimates  of  power  spectrum,  up  to  a  constant, 
from  the  estimate  of  the  bispectrum 

fJ^(A,0)  and  the  estimate  of  the  trispectrum 

f^(A,0,0).  These  consistent  estimates  are  used  in 

the  test  statistics  and  given  in  equa¬ 

tions  (11)  and  (12), 
h.  SIMULATION  RESULT 

To  demonstrate  the  effectiveness  of  the  G^^^ 
and  G^*^^  test  in  equation  (11)  and  (12),  we  will 
now  study  two  simulation  series  which  have  mixed 
spectra.  Consider  the  simulated  series  from 
equation  (1)  with  k”l  for  harmonic  process  and 

and  linear  process  defined  as  follow 
Y^  »  Rcos(u)t) 


Z|.  +  I.2Z^._j+0.6Z^._2 


where  the  coefficient  of  AR(2)  process  Z^  are 
01=1.2,  02=0.6,  and  e^.  are  Independent  expon¬ 
entially  distributed  random  deviates  with  mean  one 
generated  from  Che  GGEXN  subroutine  in  the  IMSL. 
The  number  of  observations  generated  for  each 
series  is  N=256.  Different  values  for  R  and  oi  are 
used  to  compare  the  power  of  the  methods  under 
different  conditions.  We  choose  R  •  0.5,  ii)/2n  =  l/A 
in  series  1  and  R  -  1.0,  u)/2n  =  27/6A  in  series  2. 
The  process  of  X^.,  and  its  perlodogram  are  shown 

in  Figure  1  and  Figure  2. 

Results  for  different  testing  methods  are  pre¬ 
sented  in  the  followings 
(1)  Whittle's  test 

Here  we  select  the  Bartlett  window  to  be  the 
smoothing  function  of  Z^  with  truncation  parameter 

M«25  based  on  the  autocovariance  function.  Since 

p(g^'^^  >  a)  “  l-(  l-exp( -nz) )”,  n-lN/21-1, 


Chus,  If  we  choose  Che  significant  level  a  »  .1, 
.05,  and  .01  from  =  -ln( l-( I-a ) /n,  we  have 
z  j  -  .0559,  z  Q5  “  .0615,  and  z  qj  =  .0743  re¬ 
spectively.  Figure  3  presents  the 

plot  vs.  frequency  which  shows  a  number  of  sus¬ 
pected  large  peaks.  These  peaks  are  used  to  test 
the  existence  of  harmonic  components. 

The  final  result  is  presented  in  Table  1,  where 

*,  **  and  ***  indicate  that  the  Whittle's 
statistics  are  significant  at  a  level  .1,  .05,  and 
.01  respectively.  From  Table  i,  we  find  that 
Whittle's  test  has  difficulties  In  detecting  the 
harmonic  components  when  the  mass  spectrum  of  the 

harmonic  signal  f^[iD^)  is  mixed  with  the  large 


spectrum  of  the  noise  f^(lj],  such  as  the  series 

2. 


(11)  Bartlett's  test 

Since  the  grouping  parameter  k  is  selected 
arbitarilly,  for  comparison  purposes,  we  use  the 
test  with  (a)  k“4  (b)  k=8  (c)  k=12.  The  critical 
value  z  for  can  be  calculated  by  approxi¬ 


mating  Fisher's  with  k  degrees  of  freedom, 

thus,  we  have  z^=l-(a/n)  Table  2  shows 


z^  value  for  different  grouping  parameter.  Table 


3  summarizes  the  results  in  which  one  can  see  that 
Bartlett's  test  is  similar  as  Whittle's  test.  It 
detects  the  periodicity  in  series  1  but  falls  to 
detect  the  harmonic  component  in  series  2. 

(Ill)  P(X)  test 

This  is  a  double  window  smoothing  method,  in 
which  we  choose  a  Bartlett  window  with  n  =  [28, 
and  a  truncated  perlodogram  window  with  m  =  25. 

The  P(X)  test  statistic  is  given  in  equation 

(8),  and 

A  “  2n/ 3-2m+2m^ /n  =  45.1. 
n,m 


Figure  4  plots  P*(X)  vs.  frequency  where  p*(X)  = 
p(X)/C*(0).  Since  is  a  cumulative  function  of 

asymptotically  normal  distribution,  the  signifi¬ 
cant  test  can  be  constructed  by  plotting  J  vs.  q 


and  determining  whether  crosses  the  boundary 

z  where  z  can  be  derived  from  the  usual  two- 
a  a 

sided  percentage  points  of  a  standard  normal. 


Thus,  if  a  =  0.1,  0.05  and  0.01,  we  get  z 


1.645,  z 


.05 


1.96,  and  ^  ~  2.58. 


The  results 


are  summerized  in  Table  4  which  show  that  p(X) 
test  compared  with  Whittle's  test  is  more  powerful 
when  the  mass  spectrum  fy[uj^)  is  mixed  with  peaks 

of  the  spectral  density  function  f^(X^],  but  is 

not  as  reliable  when  the  harmonic  component  is 
separated  from  the  peaks  of  the  spectral  density 
function  of  the  noise. 


(Iv) 


^(b) 

G  test  process 


To  use  the  G 


(b) 


test,  we  first  need  to  estimate 

X 


the  real  part  of  the  bispectrum  f  [Xj,0)  at  the 
frequency  fXj.O)  by  properly  selecting  the  trun¬ 
cated  lag  and  the  two-dimensional  smoothing 
window.  Here  we  select  the  two-dimensional 
Bartlett  window  to  be  the  smoothing  function  with 
M  =  15. 

Since,  under  Hq  ,  G^*^^  has  the  same  asymptotic 


distribution  as  cS^\  we  get  z  j  =  .0559, 
z  =  .0615,  2  Qj  =  ,0743.  Figure  5  shows 

the  results  of  ij^(Xj) /Rf|^(Xj  ,0)  vs.  frequency. 

The  results  are  summarized  in  Table  5,  which  show 

that  G^*^^  test  actually  detects  the  harmonic  pro¬ 
cesses  at  the  right  frequency  In  both  cases. 

(v)  test 

Since  the  linear  process  (z^l  is  generated  from 

an  exponential  distribution,  its  fourth  order 
cumulant  certainly  exists.  We  choose  the  3- 
dlmensional  Bartlett  window  with  lag  M=10.  Figure 

6  shows  the  plot  of  I*( X^ ] /Rf *(Xj  ,0 ,0)  vs.  fre¬ 
quency.  Table  6  presents  the  G^*"^  value  for 

suspect  peaks.  The  results  show  that  the  G^*"^ 
test  detects  the  harmonic  component  at  the  right 
frequency  in  both  cases  also. 


Figure  I.  lime  serii’S  i)l<it  iur 


686 


Fiwuri’  6. 


I  ) 

SERIES  #  Z  " 

) 

g'*" 

1  1 

|f‘‘N(^)  1 

1  118.1291 

1.5708 

7.1551 

*  * 

0.0606 

2  115.1999 

2.6507 

5.2138 

0.0453 

Table  1  Whittle's  test 


SERIES 

#  e 

P  *(  0  ) 

Jq  (0  ) 

1 

1.5708 

0.1246 

0.2317 

*  ★ 

2 

2.6507 

0.6364 

2.0162 

Table  4  Priestley  P  {  X  ) 

test 

CX  level 

4 

8 

12 

0.1 

.  9077 

.  6398 

.4778 

0.05 

.  9267 

.  6737 

.  5097 

0.01 

.  9571 

.  7407 

.5764 

Table  2 

significant 

periodogram 

level  for 
test 

grouped 

SERIES  # _ K _ 

•k  ie  it 

4  0.9631 

it  k  it 

1  8  0.8019 

it  it  it 

12  0.8431 

4  0.7710 

2  8  0.5382 

12  0.3344 


Table  3  grouped  periodogram  test 


I  ) 

RF.RTRR  #  E  « 

!*(>.  ) 

G 

|f’‘N(X.,0)l 

|f’‘N{X,  0)  1 

1  853.3934  1.5708 

2  708.4138  2.6507 

123.2088 

63.4764 

it  it  it 

.  1444 
*  *  * 

.0896 

Table  5  G 

test 

I’‘(A.  )  IMX  ) 

SERIES  #£_,  _ N _  m  _ IS _  G 

|f\(X  ,0,0)  I _ ,0,0)  I 


1  6193.372  1.5708  1458.640  .2355 

it  *  : 

2  _ 4414.245  2.6507  498.3887  ,1129 

Table  6  G*"’  test 
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The  Existing  form  of  TSA 


TIME  SERIES  IN  A 
MICROCOMPUTER 
ENVIRONMENT 


John  D.  Henstridge 
Perth,  Western  Australia 

ABSTRACT 

The  task  of  transfering  a  moderately  large  statistical 
package  onto  a  microcomputer  is  described.  It  is  shown 
that  substantial  consideration  has  to  be  given  to  the 
limitations  of  the  microcomputer  architecture  but  once 
this  has  been  done  the  package  can  be  made  surprisingly 
efficient.  The  user  interface  also  requires  major 
adaptation  so  that  it  is  more  compatible  with  non- 
statistical  microcomputer  software.  This  leads  to  the 
necessity  for  interactive  graphics  and  screen  based 
operations. 

Keywords:  mcrocomputers,  statistical  computing,  time 
series. 

Introduction 

TSA  was  first  released  in  1982  as  a  sitecialist  time 
series  package.  It  was  was  desiged  with  mainframe 
computers  in  mind  and  its  strength  lay  in  its  ability 
to  manipulate  time  series  data.  It  was  equally  at 
home  in  both  the  time  and  frequency  domains  and 
used  as  its  basic  data  types  series,  filters  and  Fourier 
transforms. 

Early  in  1987  it  was  decided  to  review  the  future  of 
TSA  both  in  terms  of  its  statistical  facilities  and  its 
implementation  on  various  machines.  A  market 
an^ysis  and  feedback  from  users  indicated  that  TSA 
needed  strengthening  in  time  domain  model  fitting 
(and  forecasting)  and  that  a  personal  computer 
(PC)  version  was  needed  to  complement  the 
mainframe  versions.  After  a  feasibility  study  it  was 
decided  to  update  TSA  and  have  the  PC  version  as 
the  primary  version.  It  was  thought  that  the 
constraints  of  the  PC  were  the  greatest  that  were 
likely  to  be  encountered  and  hence  a  PC  version 
could  be  ported  to  a  mainframe  more  readily  than  the 
reverse. 

This  paper  describes  what  this  involves  and  can  be 
considered  as  a  case  study  for  the  problem  of 
transfering  statistical  software  to  the  PC. 


TSA  Release  1  (Henstridge,  1980,  1982)  was  a 
program  consisting  of  approximately  14000  lines  of 
highly  portable  Fortran  66.  It  had  its  own  command 
language  similar  to  some  structured  structured 
dialects  of  BASIC,  specially  adapted  to  time  series 
data.  The  series  data  type  was  different  from  vectors 
in  many  other  packages  in  that  a  series  had  both  a 
length  and  a  starting  time.  Operations  which  would 
not  make  sense  with  vectors  (such  as  adding 
together  two  vectors  of  different  length)  were 
properly  defined  for  series  in  TSA.  In  addition 
Fourier  transforms  could  be  manipulated,  with  the 
TRANSFORM  command  could  convert  between 
series  and  transforms. 

Some  of  the  flavour  of  TSA  can  be  gained  from  the 
following  examples  of  TSA  input  which 
demonstrate  two  different  methods  of  estimating  the 
spectrum  of  a  series  X.  The  first  approach  is  to  fit  a 
parametric  model  and  then  display  the  theoretical 
gain  function  of  this: 

QAIC  20  X  %X 

Fit  an  autoregressive  model  to  X,  using  the  Akaike 
criterion  and  store  this  as  the  filter  %X 

GAIN  SQUARED  %X 

Display  the  square  of  the  gain  function  of  this  filter. 
The  second  approach  is  to  form  the  Fourier 
transform  and  thus  obtain  the  smoothed 
periodogram: 

TRANSFORM  X 

f^orm  the  Fourier  transform  and  names  it  "X 

SPECTRUM  *X 

Display  the  spectrum  by  smoothing  the  periodogram 
formed  from  'X 

The  flexibility  of  TSA  derived  from  the  fact  that  the 
data  objects  %X  and  *X  in  the  examples  could  be 
manipulated  in  themselves.  For  example  the 
transform  'X  could  have  arithmetic  operations 
performed  on  it  by  the  CALCULATE  command  or 
the  filter  %X  could  be  used  in  the  FILTER  command 
to  define  new  filters  or  to  filter  series. 

The  lime  domain  model  fitting  facilities  in  TSA 
Release  1  emphasised  univariate  Box-Jenkins  or 
ARIMA  type  models.  The  PRELIMINARY 
command  would  obtain  preliminary  estimates  and 
the  ARIMA  command  would  then  obtain  least 
squares  estimates.  Separate  commands  such  as 
QAIC  above  allowed  for  the  quick  fitting  of 
autoregressive  models,  using  automatic  order 
selection  if  desired.  The  implementation  of  these 
commands  u.sed  the  NAG  Library  extensively.  Its 
flexibility  came  from  the  way  that  models  were 
stored  as  filters  and  could  be  operated  on  by  all  the 
filter  commands  in  TSA. 
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Time  Series  Model  Fitting 


Time  series  modeling  has  a  number  of  unique 
features  which  in  combination  distinguish  it  from 
other  statistical  model  selection  and  fitting  problems. 
These  include 

(i)  The  parameters  are  constrained  to  lie  within  a 
not  readily  described  region. 

(ii)  The  estimation  procedure  itself  is  non-linear 
and  frequently  has  problems  of  local 
multicolinearity. 

(iii) The  structured  index  set  (time  itself)  means 
that  it  is  rarely  possible  to  exclude  parts  of  the 
data  or  to  model  subsets  of  the  data 
separately. 

(iv)  There  are  many  different  possible  diagnostic 
methods  which  could  be  used  in  model 
selection. 

Methods  are  available  which  can  automatically 
select  models  (Akaike,  1969  and  many  subsequent 
papers).  These  methods  tend  to  be  computationally 
very  demanding  and  the  theoretical  results  are 
almost  exclusively  asymptotic,  with  the  small 
sample  situation  not  being  well  understood  except  in 
the  case  of  simple  autoregressions.  ConsequenUy  it 
is  more  common  for  a  statistician  to  select  a  model 
by  examining  summary  statistics  such  as  the 
autocorrelation  function  followed  by  fitting  several 
models  and  obtaining  diagnostic  statistics  for  these. 

There  are  a  variety  of  summary  and  diagnostic 
statistics  in  common  use.  The  most  commonly  used 
are  the  autocorrelation  function  and  the  partial 
autocorrelation  function,  but  use  can  also  be  made  of 
spectra  and  various  residual  plots. 


Needs  of  a  Personal  Computer  Implementation 

It  is  commonly  thought  that  a  program  has  to  be 
cut  down  to  fit  onto  a  PC,  but  it  is  easy  to  foreget 
that  the  memory  available  on  a  PC  is  comparable  to 
that  available  to  the  average  interactive  user  on 
many  mainframes  only  six  or  seven  years  ago.  The 
PC  is  small  compared  with  other  machines  today  but 
any  interactive  statistical  program  which  ran  on  a 
mainframe  several  years  ago  should  be  able  to  fit 
onto  a  PC  today. 

Instead  the  main  constraints  on  the  PC  arc 


(i)  The  operating  system  is  relatively  crude  and 
is  supported  by  few  good  utilities.  It  cannot 
even  be  assumed  that  a  user  has  access  to  a 
screen  editor  let  alone  a  good  graphics 
program.  Hence  a  statistical  program  has  to 
supply  many  features  provided  by  the 
operating  system  on  other  computers.  As  a 
secondary  issue  for  the  software  developer, 
until  recently  the  Fortran  compilers  for  the 
PC  have  been  of  doubtfull  quality.  Even  now, 
PC  Fortran  compilers  are  not  totally  reliable. 

(ii)  The  disk  drives  are  relatively  slow  compared 
with  mainframes,  due  partly  to  the  operating 
system  using  only  primitive  algorithms  to 
optimise  the  use  of  the  disks.  Today  the  speed 
of  the  disks  tends  to  be  a  greater  constraint 
than  the  CPU  speed.  This  imposes  serious 
constraints  on  overlaying  a  program  - 
basically  all  overlay  changes  within  a  loop  or 
iteration  must  be  avoided. 

(iii) The  typical  PC  user  has  come  to  expect 
software  which  makes  use  of  the  wide 
communications  bandwidth  between  the  CPU 
and  the  screen.  Windows,  screen  graphics, 
prompts  and  on-screen  editing  are  all  part  of 
this.  This  is  a  good  influence  in  the  long  term 
but  it  does  create  extra  work  for  the  software 
developer. 

(iv)  The  typical  PC  user  is  not  paper  oriented. 
This  is  partly  due  to  many  users  never 
knowing  a  batch  environment  where  printed 
output  was  the  only  feedback  from  the 
computer  but  it  is  also  related  to  the  screen 
presentation  being  more  dynamic  than  a 
printed  page  can  ever  be.  It  does  mean  that  a 
statistical  package  must  enable  the  user  to 
display  virtually  any  information  at  any  time 
rather  than  assuming  that  the  user  has  a 
printed  copy  of  past  output. 

Implementation 

The  initial  feasibility  study  for  the  PC  version 
involved  a  direct  port  of  the  32-bit  mainframe 
version  onto  a  PC.  This  was  not  a  particularly 
difficult  task  since  TSA  had  originally  been  written 
with  highly  centralised  input  and  output  modules 
and  portability  was  a  major  design  consideration. 
The  initial  PC  version  had  about  360  Kbytes  of  code 
but  was  overlayed  to  run  in  less  than  200  Kbytes.  It 
made  no  use  of  special  PC  features  and  it  has  to  be 
said  that  it  ran  slowly.  The  code  size  was  surprising 
since  it  was  somewhat  larger  than  the  code  size  on 
other  computers  (especially  the  VAX  and  a  68000 
ba.scd  Unix  computer).  The  compiler  being  u.sed 
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(Microsoft)  was  generally  thought  to  produce 
efficient  code  so  the  causes  for  this  were  examined 
before  proceding  ftirther. 

Study  of  the  code  produced  by  the  compiler  made  it 
clear  that  ii  was  no  use  pielen^ng  that  the  PC  is  a  32 
bit  machine.  The  8086  microprocessor  has  16  bit 
registers  and  although  it  can  address  up  to  1  Mbyte 
of  memory,  it  is  only  efficient  when  the  code  and 
data  are  locally  restricted  to  64Kbyte  segments.  In 
addition  the  microprocessor  has  relatively  few 
registers  and  most  have  various  constraints  on  their 
use.  These  produced  a  substatial  overheads  on 
accessing  large  arrays  and  in  parameter  passing  in 
subroutine  calls.  It  became  clear  that  major  savings 
in  code  size  were  attainable  by  using  16  bit  integers, 
modifying  the  addressing  of  COMMON  blocks, 
reducing  subroutine  parameter  lists  and  transfering 
the  text  of  error  messages  to  a  separate  file.  The 
result  was  a  40%  reduction  in  code  size,  and  the 
program  running  several  times  faster.  These  changes 
in  themselves  required  remarkably  little  change  to 
the  source  code  of  TSA. 


Pt-ogram  Enhancements 

The  savings  in  code  size  permitted  a  large  number  of 
enhancements  to  be  made  to  the  original  version  of 
TSA.  In  addition  to  a  number  of  statistical  additions 
detailed  below  these  included  extended  graphics 
with  output  to  the  screen,  printer  or  plotter,  fuller 
control  of  input  and  output  and  an  on-line  help 
system. 

The  language  of  the  package  itself  was  extended  in  a 
number  of  ways  to  emphasise  the  time  series 
application.  Time  leads  and  lags  can  now  be  applied 
to  series  in  any  situation  using  a  postfix  notation,  and 
commands  have  been  extended  so  that  most 
common  operations  now  require  fewer  commands. 
As  an  example  of  what  this  ^lows,  the  instruction 
GRAPH  X  ON  X(  -1 ) 

plots  the  values  of  the  series  X  against  the  values  for 
the  previous  time  point.  Many  of  these  language 
changes  did  not  involve  increases  in  code  size  at  all; 
instead  they  involved  moving  existing  code  from 
individual  commands  into  the  parser  where  they 
could  be  used  by  all  commands. 

The  new  graphics  module  extends  the  earlier 
concept  of  a  graphical  layout  which  is  a  TSA  data 
structure  giving  alt  the  parameters  for  a  graphical 
display.  All  commands  have  a  default  layout,  but 
this  can  be  modified  by  the  user  and  if  required 
stored  as  a  user  defined  layout.  The  guiding  principle 
here  has  been  to  make  appropriately  labelled  and 


scaled  graphical  displays  readily  accessable  while 
not  reducing  the  flexibility  available  to  the  user  who 
needs  it. 

In  addition  a  screen  interface  was  developed.  This 
uses  a  status  line  to  divided  the  screen  into  an  input 
window  and  an  output  window.  Previous  lines  of 
input  can  be  viewed  and  edited,  while  previous 
output  can  be  scrolled  back  for  viewing.  The  design 
of  this  screen  interface  was  dictated  by  statistical 
considerations  outlined  below. 

The  final  version  consists  of  about  30(XX)  lines  of 
code,  of  which  97%  is  portable  Fortran  77,  2%  is 
Fortran  specific  to  the  PC  and  1%  is  assembly 
language.  On  the  PC  it  compiles  into  about  360 
Kbytes  and  runs  in  about  370Kbytes  of  RAM.  The 
program  is  overlayed  only  to  a  limited  extent; 
overlay  changes  are  sufficiently  infrequent  that  it  is 
feasible  to  run  on  a  PC  without  a  hard  disk. 


Special  Facilities  for  Model  Selection  and  Fitting 

Most  of  the  statistical  additions  to  TSA  were 
designed  to  aid  in  the  selection  and  fitting  of  time 
domain  models.  They  include  transfer  function 
modelling  with  options  of  least  squares,  maximum 
likelihood  and  marginal  likelihood  criteria  (again 
using  the  NAG  Library  routines),  the  display  of 
impulse  response  functions  and  a  DECOMPOSE 
command  which  can  use  a  variety  of  methods  to 
decompose  series  into  trend,  seasonal  and  irregular 
compxments.  These  affect  model  fitting  as  follows: 

(i)  Selection  is  aided  by  the  new  DECOMPOSE 
command  which  can  separate  the  seasonal, 
trend  and  irregular  components.  This  is  seen 
primarily  as  an  aid  to  identification  of 
models. 

(ii)  Details  of  all  models  fitted  to  a  series  are 
automatically  stored.  These  details  include 
the  stnicture  of  the  model,  the  parameter 
estimates,  goodness  of  fit  measure  and  the 
method  of  fitting  (preliminary,  least  squares, 
maximum  likelihood  or  marginal  likelihood). 
It  is  possible  to  display  these  at  any  stage.  The 
fact  that  the  display  of  more  than  one  model 
might  not  fit  on  the  screen  at  one  time  was  the 
major  reason  for  the  scrolling  facility  on  the 
output  window  (since  as  outlined  above,  the 
user  has  probably  not  got  a  printed  record). 

(iii) A  brief  comparative  display  of  all  models 
fitted  is  also  available.  Together  these  two 
features  provide  the  user  with  a  means  of 
managing  a  number  of  competing  models. 
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(iv)  Residual  and  fitted  value  series  are 
automatically  extracted  for  any  model 
properly  fitted.  In  addition  the  individual 
components  of  a  model  can  be  extracted  as 
filters.  These  series  and  filters  are  then 
immediately  available  for  diagnostic  analysis 
by  a  number  of  TSA  commands  as  shown  in 
Figure  1. 

(v)  In  the  selection  process  it  is  common  to  fit 
models  which  are  modifications  on 
previously  fitted  models.  Rather  than 
implement  a  special  command  to  edit  an 
existing  model,  the  input  window  editor  was 
developed  so  that  the  input  line  where  the 
previous  model  was  specified  could  be  edited 
and  then  entered  as  new  input.  This  was 
considered  to  be  visually  easier  arxl  it  gives  a 
facility  which  can  be  used  in  other  situations 
as  well.  In  addition  the  filters  which  are 
extracted  from  previous  models  can  be  used 
to  define  components  of  a  new  model. 

Summary 

The  final  result  of  this  work  was  a  new  version  of 
TSA  which  had  greatly  improved  functionality  while 
still  fitting  on  a  standartl  PC.  The  most  striking  result 
was  that  the  original  portability  of  the  program  of  the 


program  could  be  maintained  while  giving  a  true  PC 
implementation  which  makes  use  of  the  unique 
features  of  the  PC.  Exp)erience  has  shown  that  a 
conversion  to  a  different  machine  is  less  than  a  days 
work  and  it  has  been  possible  to  automate  most  of 
this  conversion  process. 

Most  of  the  lessons  learnt  from  this  would  apply  to 
other  statistical  packages  which  are  command 
oriented. 


Aknowledgement 

The  author  wishes  to  thank  NAG  Ltd.,  Oxford  for 
the  use  of  its  facilities  and  access  to  the  NAG 
Library  for  this  work. 


References 

AKA1KE,H.  (1969),  Fitting  autoregressions  for 
predictions.  Annals  of  the  Institute  of  Statistical 
Mathematics,  Tokyo,  27, 243-247. 

HENSTRIDGE,  J.  D.,  (1980),  TSA  -  a  package  for 
time  series  analysis,  COMPSTATSO,  ed.  M.M.  Barriu  and 
D  Wishart,  Physica  Verlag,  Wien. 

HENSTRIDGE,  J.  D.  (1982),  TSA  -  an  interactive 
package  for  time  series  analysis.  NAG,  Oxford. 


MOVING  WINDOW  DETECTION  FOR  0-1  MARKOV  TRIALS 


Joseph  Glaz,  Philip  C.  Hormel  and  Bruce  McK.  Johnson* 
University  of  Connecticut  and  CIBA-GEIGY  Corporation 


Abstract.  Let  be  a  sequence  of 

0-1  Markov  trials.  The  random  variable  X^ 

represents  the  numbet  of  signals  that  were 
detected  at  the  end  of  the  ith  discrete-time 
interval.  The  k-out-of-m  moving  window  detector 
generates  a  pulse  whenever  k  or  more  signals  are 
detected  within  m  consecutive  discrete-time 

intervals.  Define  M,  to  be  the  waiting  time 
k  ,m  ^ 

for  detection  using  a  k-out-of-m  moving  window 
detector.  In  this  article  we  derive  Bonferroni- 
type  inequalities  and  product- type  approxima¬ 
tions  for  the  distribution  of  M,  ,  which  in 

k  ,m 

turn  yield  approximations  for  E(M|^  and 
Var(M|^  ^).  These  quantities  play  an  important 

role  in  the  design  and  analysis  of  the  k-out-of- 
m  moving  window  detection  procedure.  Applica¬ 
tions  to  the  theory  of  radar  detection  and 
quality  control  Uoee  tests)  are  discussed. 

1.  Introduction.  Let  X.| , . . .  ,X^ , . . .  be  a 

sequence  of  0-1  valued  random  variables.  The 
random  variable  X.  represents  the  number  of 

signals  that  were  detected  at  the  end  of  the  ith 

discrete-time  interval.  The  k-out-of-m  moving 

window  detector  generates  a  pulse  whenever  k  or 

more  signals  are  detected  within  m  consecutive 

discrete-time  intervals.  Define  M.  „  to  be  the 

k  ,m 

waiting  time  for  detection  using  a  k-out-of-m 
moving  window  detector.  Then  E(M|^  and 

VariM.  )  are  the  mean  and  the  variance  of 
k  ,m 

recurrence  time,  respectively,  for  at  least  k 
events  in  a  moving  window  of  size  m.  The  wait¬ 
ing  time,  the  mean  and  the  variance  of  recur¬ 
rence  for  detection  using  a  k-out-of-m  moving 
window  detection  procedure  play  an  important 
role  in  a  variety  of  applications.  We  describe 
below  applications  to  quality  control  and  radar 
detection. 

In  quality  control,  the  sequence  of  I's 
correspond  to  defective  items.  Greenberg  [9], 
Roberts  [20],  and  Saperstein  [23]  study  the 
properties  of  the  zone  tests.  They  define  the 
process  to  be  "out  of  control"  if  a  moving 
window  of  size  m  contains  at  least  k  observa¬ 
tions  outside  a  specified  zone  (say,  the  three 
sigma  limits  about  the  mean).  The  random  vari¬ 
able  M,  is  the  waiting  time  between  times  at 
k  ,m  ^ 

which  the  process  is  declared  to  be  "out  of 
control"  and  1/E(M|^  i^)  is  the  probability  of 

type  I  error  for  testing  the  hypothesis  that 
the  process  is  "in  control."  For  k  =  m  the  zone 
is  based  on  the  run  statistic,  which  is  a 
special  case  of  the  moving  window  statistics, 
was  introduced  by  Mosteller  [15]. 

Consider  a  radar  sweep  where  a  dichotomous 
quantizer  transmits  to  the  detector  the  digit  1 
if  the  signal-plus-noise  waveform  exceeds  a 
predetermined  threshold,  and  the  digit  0  other¬ 
wise.  Thus  the  data  from  a  radar  sweep  is 


transformed  into  a  random  sequence  of  0  and  1. 

The  k-out-of-m  moving  window  detector  generates 
a  pulse  whenever  k  or  more  I's  were  observed 
within  a  consecutive  string  of  m  elements.  The 
moving  window  detection  procedure  has  been  dis¬ 
cussed  extensively  in  the  literature  ([1],  [3], 
[5],  [6],  [14],  [18],  [25]).  Dinneen  and  Reed 
[3]  discuss  signal  detection  and  location  by 
various  digital  methods.  They  conclude  that 
"the  moving  window  detector  satisfies  the  detec¬ 
tion  and  beam  sf’itting  criteria  and  at  the  same 
time  is  logically  the  simplest."  Moreover,  one 
can  obtain  a  good  estimate  of  the  target  center 
by  employing  the  center  of  the  window  where  at 
least  k  signals  were  observed.  Bogush  [1], 

Galati  and  Studer  [5],  Lefferts  [14],  Nelson  [18] 
and  Todd  [25]  study  the  moving  window  detection 
procedure  under  the  assumption  that  the  observed 
sequence  of  0  and  1  is  generated  by  a  simple 
Markov  process.  The  random  variable  is  the 

waiting  time  between  the  times  that  the  k-out-of- 
m  moving  window  detector  generates  a  pulse.  The 
quantities  P(M^^^  >  n),  and  Var(M^^J 

play  an  important  role  in  the  design  of  the  mov¬ 
ing  window  detector. 

The  evaluation  of  the  quantities  P(M|^  m  ^ 

^^^k  Var(M|^  i^)  is  a  formidable  task.  Even 

in  the  simpler  situation,  when  the  sequence  of  0 
and  1  are  i.i.d.  Bernoulli  trials,  these  quanti¬ 
ties  can  be  evaluated  only  for  limited  values  of 
k  and  m  ([9],  [12],  [16]  and  [23]).  Moreover, 
the  methods  developed  for  evaluating  the  quanti¬ 
ties  mentioned  above  in  the  i.i.d.  Bernoulli  case 
cannot  be  extended  for  the  Markov  model.  Naus 
[17]  and  Samuel-Cahn  [22]  developed  accurate 
approximations  for  the  i.i.d.  Bernoulli  case, 
that  too  cannot  be  easily  extended  to  the  Markov 
model.  For  the  Markov  model,  Glaz  [6,  Section 
III]  derived  a  product-type  lower  bound  for 
P(M.  _  >  n)  and  E(M.  ).  Since  in  many  instances 

l\  ^  IM  In  ^  11 1 

this  lower  bound  is  quite  conservative  (see 
Tables  I-II  in  Section  3),  there  is  a  need  for 
more  accurate  approximations. 

In  Section  2  we  derive  a  product-type  approxi¬ 
mation  for  P(M.  -■  n)  that  is  far  more  accurate 
k  ,m 

than  the  lower  bound  that  was  derived  in  [5]. 

This  in  turn  yields  accurate  approximations  for 
E(M.  )  and  Var(M,  ).  Recently,  Hoover  [11] 

K  ^  Ml  K  j  Mr 

derived  Bonferroni-type  upper  (lower)  bounds  for 
a  union  (intersection)  of  a  given  sequence  of 
events.  For  the  problem  at  hand  we  evaluate  in 
Section  2  the  Bonferroni-type  lower  bounds  for 

P(M.  >  n). 

'  k,m  ' 

In  Section  3,  Tables  l-II,  we  present  the 
results  of  a  simulation  study  for  evaluating 
P(”k,m  -  Var(M^^J.  These 

results  are  used  to  compare  the  product- type  and 
Bonferroni-type  approximations  in  Tables  I-II.  A 
discussion  of  the  approximations  is  presented  at 
the  end  of  Section  3. 
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2.  Product-type  and  Bonferroni-type  approxi¬ 
mations.  Suppose  that  the  observations 

form  a  0-1  sequence  of  Markov  trials.  Assume 
that 

P  =  P(X^  =  1) 
and 

p.  =  P(Xj  =  liXj_,  =  1).  1  =  0.1;  j  =  2,3..... 


Proof.  To  derive  equation  (2.5),  we  use  the 
following  results  from  Helgert  [10]: 

P(/^X.=jlx^  =  l)  =  S{j-l,n-l)  +  (Po-Pi)S(j-l.n-2) 
and 

P(;'^X.=j|x^=0)  =  S(j,n-l)  +  (pQ-p^)S(j-l,n-2). 


We  assume  that  the  correlation  coefficient  of  two 
successive  observations  is  positive  and 
p  =  P(Xj  =  1),  j  ^  1.  Therefore  X.|,X2,...  is  a 

two-state  stationary  homogeneous  Markov  chain 
and  all  the  conditional  probabilities  can  be 
derived  from  p  and  p^ ,  i  =  0,1.  Moreover,  it 

follows  that  Pq  <  p  <  p.|.  The  waiting  time  for 

the  detection  using  a  k-out-of-m  window  detector 
satisfies 

n 

M.  =  infin  ^  1 ;  X.  ^  k},  (2.1) 

’  i=max( 1 ,n-m+l ) 


which  we  will  abbreviate  to  M.  For  n  m  2 
we  define 

n-m+1  m+j-1 

=  P(  n  (  :  X.  <  k)},  (2.2) 

0=1  i=j  ’ 


'n-m+1 ,n 


min(j,n-j)  •  •  •  • 

5(j,n)  ^  ::  )('^)pV  (1-pj"  ^  ’ 

i=0  J  ’  ' 


•(Po-Pi)\ 


and 


St(n)  =  S(j,n), 

0=0 


(2.3) 


(2.4) 


in  terms  of  which  we  have  the  following  results. 

Theorem  2_J .  Let  M  be  the  waiting  time  for 
detection,  then  for  k  -  n  _  m+2 

P(M  n)  -  'max(l  ,n-rn+l )  ,n  ’ 

where 


'l,n  "  S|^_.|(n-l)  +  (Pg-p^)S|^_^(n-2) 

-  pS(k-l  ,n-l ) ,  k  '  n  ■  in, 

,  k-1 

'2,m+l  " 


(2.5) 


i=max{0,2k-ni-l ) 
r,  k-1  w  m-k  v„k-i  i 

ttk-i-l^^k-i-l'f’o 

(,_p^)'"+l-2k+i^,_p^  jk-i-1^^  (2.6) 


and 


'3,ni+2  '2,111+1 


■d-p) 


k-2 


i=niax(0,2k-ni-2) 


r/  k-2  s/  m-k  ,  k-i-1  i 

r(:,.-;.2>^k-i-2’'^0  Pi 

,,  sm+2-2k+i,,  .k-i-P. 

•(1-Pq)  (1-P,)  ] 

r  11  m-2k+i+2  „  i 

•[PoH-Pi)^  V-T-l  Pl^- 


(2.7) 


Therefore, 


'l,n 


n 

=  P{  •:  X. 
i  =  l  ^ 


k) 


k-1 

=  r  •  p[S(j-l  ,11-1  )  +  (p  -p.  )S(j-l  ,n-2)] 
0=0  ' 


+  (l-p)[S(o ,n-l )  +  (Pq-P^ )S(o-1  ,n-2)]  :. 


Simplifying  the  expression  given  above  results  in 
equation  (2.5). 

To  evaluate  -.g  ,  note  that 


‘'2,m+l 


'  1  ,m 


P(A), 


m 

where  A  =  (X.|=0,  ;  X^  =  k-1  ,X^_i^.|  =  l ) .  The  event  A 

involves  m  transitions  from  the  initial  state  0 
to  the  final  state  1.  Denote  by  o’l  O2 

transition  from  state  j.|  to  state  02>  O'l  .O2  ' 

0,1.  Let  i  be  the  number  of  1  •>  1  transitions. 
Then,  max(0,2k-m-l )  i  <  k-1.  Since  we  have  a 
total  of  k  visits  to  state  1,  we  must  have  k-i 
0  ■  1  transitions.  As  the  initial  state  is  0  and 
the  final  state  is  1,  the  number  of  1  0  tran¬ 

sitions  is  equal  to  k  -  i  -  1  (the  number  of 
0  -  1  transitions  minus  one).  Therefore,  the 
remaining  m  +  1  +  i  -  2k  transitions  are  of  the 

k- 1 

type  0  ■  0.  Since  there  are  (|^_^.  .j )  ways  of 

arranging  the  k  -  i  -  1  0  -  1  transitions  be- 

ni“  k 

tween  the  first  and  the  last  1  and  (|^\  .^ )  ways 

of  arranging  the  k  -  i  -  1  1  •  0  transitions  be¬ 
tween  the  first  and  the  last  0,  we  get  that 


k  1 

P{A)  =  (1-p)  '(k-i  1^ 

i =max (0 ,2k-m- 1 ) 

,  m-k  '.^k-i^i,,  ,m+l-i-2k,,  , k-i-1 

•(k-i-dPo  Pi('-Po>  (1-Pl^ 

Equation  (2.6)  follows. 

To  evaluate  •-  ,,,  define  the  events 

3,111+1 


A,  =  (X^=0.X2-0,  ^X.  =  k-l,X,,^^^=0,X,,,^2=')’ 


A„  -  (X,=0,X.,  =  0,  X.  =  k-2.X  ^,  =  1  ,X  .,-1), 
2  12  ■  ,  1  m+1  m+2 

1  =3 


and 

m 

A-  =  (X,=1,X,=0,  X.=k-2,X  ^,=1,X  ,0=1). 

3  '12  i-3  ’  '”  2 

Then  it  follows  that 
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where 


^3,m+2  ■  ^2,ra+l 


3 

I  P(A.). 
j=l  ^ 


* 

"  ^L,m+L-l'''^L-l,m+L-2  ’ 


Using  a  similar  enumeration  technique  for  evalu¬ 
ating  P(A)  above,  one  obtains: 

k-2 


P(Ai)  =  (1-p) 


i=max(0,2k-m-2) 


(  ^-2  ) 
'k-i-2'^ 


,  m-k  i„k-i  i,,  ,m+2-2k+i 

•(k-i-l’Po  Pi(i-Po^ 


'(1-P 


,k-i-l 


P(A,)  =  (1-p) 


k-2 


i=max(0,2k-m-2) 


,  k-2  , 
■k-i-2^ 


,  m-k  ,  k-i-1  i+1,,  „  ,m+3-2k+i 

■(k-i-2>P0  Pi  ^’-Pq’ 


■  (1-Pl ) 


k-i-2 


and 


P(A, 


"  P  '■  ^k-i-pl 

i=max(0,2k-m-2)  ^ 

,  m-k  \„k-i-l  i+1,,  „  ,m+2-2k+i 

•(k-i-2^P0  Pi  (^-Pq^ 

n  r.  -k-i-1 

•d-p,; 

'  3 

Simplifying  the  expressions  for  P(Aj  yields 

j  =  l  J 

equation  (2.7).  This  concludes  the  proof  of 
Theorem  2.1. 


We  now  proceed  to  derive  an  approximation  for 

j+m-1 

P(M  •  n).  Let  E.  denote  the  event  r  X.  <  k, 

J  i=j 

j  =  1,2, _  Then  for  1  ^  L  n  -  m 

n-m+1 

P(M  n)  r\  E.) 

^  j  =  l  ^ 

L  n-m+1  j-1 

=  P(nE.)  P(E.;  OE;) 
j  =  l  i=L  +  l  '  i-1 

n-m+1 

'L,m+L-1  ^  '  j  ,m+j-l'^ '  j-1  ,m+j-2'’ 

(2.6) 

We  propose  to  employ  the  following  (L-l)th  order 

j-1 

"Markov-1  ike"  approximation  for  P{E.'/^E.): 

J  i  =  i  ' 


j-1  J-1 

P(E.;  AEJ  .  P(E.;  n  EJ,  J  L+1.  (2.9) 

J  i  =  l  ’  i=j-L+l  ' 


Substitute  the  right-hand  side  of  equation  (2.9) 
into  equation  (2.8)  and  use  the  stationari ty  of 
the  events  E ^ ,  to  get  the  desired  product-type 

approximation : 


P(M  n) 


,  *>n-m-L+l 
'L,m-^i.-r  'L'' 


L’ 


(2.10) 


and  Yj  Is  defined  in  equation  (2.2).  The 

following  result  supports  the  approximation 

(2.10). 

Theorem  2.2.  Let  M  be  the  waiting  time  for 
detection  given  by  equation  (2.1).  Then  there 
exists  a  real  number  0  <  y  r  1  such  that 

lim  P(M  >  n+1 |M  >  n)  =  y. 

n-Ko 

Proof.  Let  S  be  the  set  of  all  possible 
binary  sequences  of  length  m.  Then  the  cardin¬ 
ality  of  S  is  2'".  Out  of  all  binary  sequences 
with  the  property  that  their  sum  is  greater  or 
equal  to  k  we  create  a  single  absorbing  state. 

Let  S*  be  the  set  containing  the  absorbing 
state  and  the  remaining  elements  of  the  set  S. 

Then  IX^  +  ...  +  X^^j_.|  is  a  finite  Markov 

chain  with  a  single  absorbing  state  and  one  set 
of  communicating  transient  states.  It  now 
follows  from  Darroch  and  Seneta  [2,54]  that  there 
exists  a  constant  0  <  y  <  1  such  that 


1  im  P(M  >  n+1  |M  >  n)  =  -, . 

n-HT. 


This  concludes  the  proof  of  Theorem  2.2. 


It  follows  from  Theorem  2.2  that  the  sequence 
of  conditional  probabilities  Vj  |,i+j-2 

become  stationary  as  j  increases.  Therefore,  it 
seems  plausible  to  replace  the  terms 
'j.m+j-/'^j-l.m+j-2  J  ^L+l  in  equation 

(2.8)  with  y;;  =  \,m+L-l/n-l,m+L-2' 
stitution  results  in  approximation  given  by 


equation  (2.10).  In  Section  3,  Table  I,  we  pre¬ 
sent  numerical  results  for  this  approximation,  in 
the  case  of  L  =  3. 

We  now  turn  to  the  problem  of  deriving 
Bonferroni-type  lower  bounds  for  P(M  >  n). 
Recently  Hoover  (1987)  has  derived  a  sequence  of 
Bonferroni-type  upper  bounds  of  order  L, 

1  <  L  <  n-1: 


n  L 

P-  UA,-1  <  Pi'I^A.l 
i=l  ’  '  i=l  ’ 


n  L 

PfA.n  [  n  (U  A,  )‘^],  (2.11) 


i=L  +  l 


ij'S.  j  =  l 
c. . .  <i|^ <n 


where  Ap...,A^  is  a  sequence  of  events,  c  denotes 
the  complement  of  an  event,  and  is  a  subset  of 

il,...,i-l;  of  size  L-1.  For  L  =  1  and  L  =  2  the 
right-hand  side  in  (2.11)  reduces  to  the  usual 
Bonferroni  upper  bound  and  Hunter  Bonferroni-type 
upper  bound  (Hoover,  1987).  In  the  case  that 
Aj,...,A^  are  naturally  ordered  in  such  a  way 
m 


that  P(nA.  )  is  maximized  for  i.  -i.  , 

j  =  l  ’j  J'' 


1 , 
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2  <  j  ^  m  <  n-1,  the  natural  ordering  with  S- 

=  {i-1 .  ,i-L}  is  recommended.  In  this  case  the 
upper  bound  in  equation  (2.11)  reduces  to: 

P(  U  A.)  <  "  P(A.)  -  ^^(Ai^A  ,) 
i=l  ’  i=l  ^  i=l  ^  ^  ' 

L-1  n-j  j-1 

-  E  E  P{A.n{^  A^  ),OA.. 

j=2  i  =  l  ^  t=l  ^  ^  ^ 

If  the  events  A.|,...,A^  are  stationary,  a 
further  simplification  of  (2.11)  is  obtained: 

P(  U  A.)  <  nP(A^)  -  (n-l)P(A^flA2) 

L-1  j-1  ^ 

-  X  (n-j)p{A,n(nA^,,)nA  ,). 

j=2  '  i=l 

(2.12) 

The  following  result  gives  the  Bonferroni- 
type  lower  bounds  for  P(M  >  n). 

Theorem  2.3.  Let  M  be  the  waiting  time  for 
detection  given  by  equation  (2.1).  Then  for 
L  >  1 


Var(M)  =  2  E  nP(M  >  n)  +  E(M)[1  -  E(M)]. 
n=l 

Note  that  for  n  £  k-1,  P(M  >  n)  =  1.  Therefore, 
m+L-1 

E(M)  -  (k-1)  +  ■^maxd  ,n-m+l )  ,n 


and 


*  n-m+l 

n=m+L 
m+1-1 

Var(M)  =  (k-l)k  +  2  i  nP(M  >  n) 
n=k 


(2.16) 


+  2  r.  ny 
n=m+L 


n-m+l  ,n ’ 


(2.17) 


Substitute  in  equations  (2.16)  and  (2.17)  for 
Yn-m+i  ^  its  product  type  approximation 

Yl  m+L-1  ^  evaluate  the  respec¬ 

tive  geometric  series  to  get: 


E(M)  ^  (k-1) 


m+L-1 

^  ^max(l , n-m+l )  ,n 


P(M  >  n)  >  (n-ni+2-L)YL^^+L-l 


-  (n--^+l-L)YL.i,m+L-2  ' 

•^O.m-2  =  1.  and  for  L  >  1  Yl, m+L-1 
defined  in  equation  (2.2).  Moreover,  for  L  >  1 


Proof.  It  follows  from  equation  (2.8)  and 
(2.T3rThat  for  L  >  1 


P(M  >  n)  >  1  -  (n-m+l )P(E^)  -  (n-m)P(E^n  E^) 
L-1  j-1 

-  (n-m+l-s)PfE^fl(n  E.,,)riE^.,}. 
j=2  '  i=l  ’  '  J  ' 

(2.15) 


j+m-1 


where  E,-  =  (  +  X.  <  k).  It  is  routine  to 

J  i=j 

verify  that  the  right-hand  side  of  equation 
(2.15)  simplifies  to  1:1^,  given  by  the  right-hand 


side  of  equation  (2.13).  Equation (2. 14)  follows 
immediately  from  the  definition  of  r-.  .  This 
completes  the  proof  of  Theorem  2.3. 

In  Section  3,  Table  I,  we  evaluate  the  per¬ 
formance  of  the  lower  bound  for  L  =  3. 


We  now  proceed  to  derive  approximations  for 
E(M)  and  Var(M),  the  mean  and  the  variance  of 
recurrence  time,  respectively,  for  at  least  k 
events  in  a  moving  window  of  size  m.  It  is  well- 
known,  [4,  264-266],  that 


E(M)  =  ;;  P(M  >  n) 
n=0 


and 


*  =  if  12-'®) 

m+L-1 

Var(M)  %  k(k-l)  +  2  E 
n*k 

^  ^^L,L+m-l\t("’■''■^(^-'^^  -'L^ 
/(1-Yl)^  +  El  [1  -  E^]  ?  V^. 

(2.19) 

where  Yj  j+m-l  defined  in  equation  (2.2)  and 
\  ^  \,m+L-/\-l,m+L-2-  Section  3,  Table 

II,  we  evaluate  the  approximations  E^  and  SD^ 

=  ,  for  L  =  3. 

Remark.  The  Bonferroni-type  inequalities  for 
P(M  >  n)  are  not  suitable  for  evaluating  E(M) 
and  Var(M)  (for  large  Oq,  ^  0  for  n  1  n^). 

3.  Numerical  Examples.  We  now  evaluate  for 
selected  values  of  m,  k,  p,  p.|  and  n  the  bounds 

and  the  approximations  for  P(M  >  n),  E(M)  and 

SD(M)  =  v/Var(M)  ,  that  have  been  derived  in 
Section  2.  These  results  are  compared  in  Tables 
I-II  with  the  approximations  that  have  been 
derived  in  Glaz  (1983). 

For  the  simulated  values  of  P(M  >  n),  E(M)  and 
SD(M)  (denoted  in  the  tables  below  by  SIM) 

10,000  replicates  of  N  12  x  ’sb2(M)  pseudo¬ 
random  numbers  uniformly  distributed  on  the 
interval  (0,1)  were  generated,  using  IMSL 
routine  GGUBS.  Each  of  the  uniform  pseudo¬ 
random  numbers  were  converted  to  an  observation 
from  a  desired  0-1  Markov  process.  The  reason 
for  having  to  generate  a  sequence  of  Markov 
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TABLE  I 


APPROXIMATIONS  FOR  THE  PROBABILITY  OF  THE  WAITING  TIME  FOR 


DETECTION; 

TWO-STATE  MARKOV  CHAIN 

m 

k 

P 

h 

n 

LB  71^ 

LB  33 

’'3 

SIM 

10 

3 

.05 

.10 

20 

.9148 

.9474 

.9479 

.9503 

50 

.5467 

.8405 

.8497 

.8549 

.50 

20 

.8542 

.8641 

.8661 

.8642 

50 

.5985 

.6482 

.6858 

.6819 

.10 

.20 

20 

.6740 

.7562 

.7652 

.7719 

50 

.1407 

.2992 

.4567 

.4715 

.80 

20 

.7270 

.7266 

.7324 

.7373 

50 

.4543 

.3725 

.4773 

.4745 

10 

5 

.05 

.10 

20 

.9972 

.9989 

.9989 

.9982 

50 

.9897 

.9962 

.9962 

.9974 

.50 

20 

.9558 

.9645 

.9647 

.9634 

50 

.8114 

.8997 

.9031 

.9010 

.10 

.20 

20 

.9558 

.9787 

.9788 

.9780 

50 

.8114 

.9313 

.9331 

.9341 

.80 

20 

.8248 

.8321 

.8346 

.8368 

50 

.5855 

.5872 

.6361 

.6341 

25 

3 

.05 

.10 

35 

.7416 

.7539 

.7679 

.7700 

50 

.5467 

.6146 

.6441 

.6603 

.50 

35 

.7244 

.7299 

.7328 

.7312 

50 

.5985 

.6080 

.6291 

.6268 

.10 

.20 

35 

.3362 

.3430 

.3691 

.3755 

50 

.1407 

.0368 

.2040 

.2262 

.80 

35 

.5767 

.5756 

.5806 

.5680 

50 

.4543 

.4287 

.4646 

.4567 

25 

5 

.05 

.10 

35 

.9632 

.9746 

.9747 

.9766 

50 

.8858 

.9532 

.9538 

.9593 

.50 

35 

.8915 

.8993 

.8998 

.9038 

50 

.8114 

.8435 

.8475 

.8547 

10 

.20 

35 

.7184 

.7689 

.7737 

.7847 

50 

.449? 

.6040 

.6390 

.6687 

O 

CO 

35 

.6991 

.7042 

.7069 

.7065 

TABLE  II 

APPROXIMATIONS  FOR  THE  EXPECTED  AND  STANDARD  DEVIATION  OF  THE  WAITING 
TIME  FOR  DETECTION:  TWO-STATE  MARKOV  CHAIN 


P 

Pi 

LB  E.,(M) 

E3(M) 

SIM  E(M) 

SDgCM) 

SIM  SD(M) 

.05 

.10 

68.46 

280.16 

299.10 

274.36 

294.37 

.25 

65.06 

184.80 

188.11 

180.65 

185.03 

.50 

77.10 

130.56 

130.19 

128.53 

128.91 

.10 

.20 

31.12 

63.06 

66.06 

58.17 

62.91 

.50 

37.20 

55.59 

56.27 

53.26 

53.77 

.80 

61.50 

68.88 

68.46 

69.94 

69.61 

.05 

.10 

3987.28 

11232.33 

11381.59  11224.67 

11463.61 

.25 

458.40 

1894.20 

1921.95 

1887.91 

1932.93 

.50 

117.10 

458.80 

464.60 

454.64 

473.99 

.10 

.20 

194.30 

634.17 

654.03 

627.10 

656.02 

.50 

57.20 

176.54 

175.87 

172.03 

172.50 

.80 

81.50 

111.07 

109.93 

110.35 

108.88 

.05 

.10 

61.06 

98.08 

108.88 

85.48 

98.19 

.25 

65.06 

97.96 

104.29 

88.36 

95.93 

.50 

77.10 

103.12 

105.93 

98.47 

102.34 

.10 

.20 

31.12 

35.11 

36.71 

25.81 

28.82 

.50 

37.20 

42.44 

42.95 

37.40 

38.53 

.80 

61.50 

66.36 

64.64 

67.13 

65.82 

.05 

.10 

114.78 

708.03 

826.66 

690.23 

791.41 

.25 

105.06 

413.94 

464.56 

399.21 

456.29 

.50 

117.10 

259.18 

272.58 

250.21 

267.06 

.10 

.20 

51.12 

93.70 

109.45 

78.56 

97.14 

.50 

57.20 

87.16 

93.99 

77.41 

85.57 

.80 

81.50 

100.18 

100.04 

98.80 

98.57 
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trials  of  length  N  %  12  x  SD3(M)  is  that  the  dis¬ 
tribution  of  M  has  a  very  heavy  right  tail.  The 
quantity  N  ^  12  x  SD2(M)  has  been  adopted  after 
some  numerical  expenmentation  with  evaluating 
E(M)  via  a  simulation. 

From  the  numerical  results  in  Tables  I-II,  we 
can  conclude  that  the  new  product- type  approxima¬ 
tions  for  P(M  >  n)  and  E(M),  given  by  H3  and 
E3(M),  respectively,  significantly  improve  the 
approximations  and  Ei(H)  for  these  quantities 

that  have  been  studied  in  Glaz  (1983).  The  new 
approximations  for  SD(M),  SD3(M)  are  also  much 
more  accurate  than  the  approximation  ?Di(M). 
Moreover,  the  product-type  approximation  is 
more  accurate  than  the  Bonferroni-type  lower 
bound  SO'”®  cases  the  improvement  of 

over  ^3  is  remarkable.  For  example,  if  m  =  25, 
k  =  3,  p  =  .10,  P]  =  .20  and  n  =  50,  then  83 
=  .0368,  713  =  .2040  and  the  simulated  value  of 
P(M  >  50)  =  .2262.  Another  deficiency  of  the 
Bonferroni-type  lower  bounds  for  P{M  >  n)  is  that 
for  n  oq,  it  has  a  negative  value.  For  this 
reason  we  have  not  evaluated  the  Bonferroni-type 
lower  bound  for  E{M)  and  the  related  approxima¬ 
tion  for  SD(M). 

Although  the  new  approximations  provide  us 
with  quite  accurate  results  in  most  cases,  there 
is  still  room  for  improvement.  For  example,  if 
m  =  25,  k  =  5,  p  =  .05  and  pi  =  .10,  the  simula- 
ter  values  for  E(M)  and  S0(Mj  are  826.66  and 
^1.41,  respectively,  while  E3(M)  =  708.03  and 
SDjCM)  =  690.23.  This  amounts  to  a  relative 
error  of  14%  in  approximating  E(M)  and  13%  in 
approximating  S0(M).  To  improve  these  approxima¬ 
tions  one  can  try  to  evaluate  the  approximations 
SD|^(M)  for  L  >  3.  We  will  report 
these  results  in  a  subsequent  article. 
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Regrettably,  Professor  Johnson  died  on  November 
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Abstract :  This  research  has  been  motivated  by  the 
need  to  study  meteorological  radar  signals.  The 
power  received  by  a  meteorological  radar  is  the 
energy  b ac ks ca t t e r e d  from  an  ensemble  of 
meteorological  targets.  The  time  variation  of 
this  power  can  be  modelled  as  a  time  series  with 
exponential  marginal  distribution.  Moreover  the 
signals  are  observed  at  two  polarization  states  of 
the  transmitted  wave  and  are  correlated.  This 
paper  deals  with  the  inference  problem  associated 
with  the  above  described  radar  signals.  We 
discuss  two  different  schemes,  one  based  on  second 
order  moments  and  the  other  using  the  distribution 
functions.  The  simulation  study  of  these  two 
schemes  show  that  they  have  similar  performance 
and  hence  the  simpler  moment  technique  can  be  used 
with  real  time  radar  applications. 

1 .  Introduction 

The  time  series  under  consideration  in  this 
paper  is  collected  by  a  meteorological  radar  which 
receives  the  signals  backscattered  from  an 
ensemble  of  hydrometeors  (particles  like 
raindrops,  hail,  ice,  etc.).  These  particles  also 
have  a  size  distribution  and  orientation 
distribution  associated  with  them.  Thus  we  have 
an  ensemble  of  particles  that  are  randomly 
positioned,  randomly  distributed  in  size  shape  and 
orientation  and  move  randomly.  Fluctuation  of  the 
received  power  is  related  to  all  the  above 
distributions.  The  marginal  distribution  of  the 
received  power  is  exponential  in  nature.  The 
medium  when  observed  at  different  polarizations 
give  different  mean  powers  due  to  the  anisotropy 
of  the  medium,  but  they  still  are  correlated  since 
the  observations  come  from  the  same  set  of 
targets.  Statistical  properties  of  these  dual 
polarized  signals  have  been  studied  by  Brlngl,  et 
al.,  1983.  Simultaneous  observation  of  the 
targets  at  two  polarizations  is  difficult  to 
achieve  technologically,  and  as  a  result  the 
observation  is  made  at  two  polarizations  switching 
between  them  very  fast.  This  creates  an  inherent 
property  of  missing  observation. 

Thus  we  have  a  class  of  multivariate 
exponential  time  series  describing  the 
backscattered  power  received  by  a  meteorological 
radar.  This  paper  deals  with  inference  problem 
associated  with  this  exponential  time  series  and 
the  paper  is  organized  as  follows:  Section  2 
deals  with  the  description  of  the  exponential 
series  in  terms  of  complex  Gaussian  time  series. 
Section  3  analyzes  the  one  step  predictors  namely 
moment  method  and  conditional  expectation.  In  section 
h  we  obtain  the  corresponding  two  step  predictors 
and  Section  5  presents  the  conclusions  with  a 
summary  of  key  results  of  this  paper. 


2 .  The  Exponential  Time  Series 


Le  t 


"t  " 


£(0,  ±  1, 


. )  be 


the  m-variate  zero  mean  complex-valued  stationary 
Gaussian  series.  If  we  restrict  our  attention  to 
the  case  when  "^t'  uncorrelated 

sequences,  then  we  can  construct  the  exponential 

P.  -  U?  +  V?  where 
Jt  Jt  Jt 


(P^)  where 


series  as 

and  are  the  j th  components  of  the  real  and 


imaginary  parts.  U^,  of  the  multivariate 


Gaussian  series  x^..  Properties  of  this  series 

and  the  relationship  with  the  complex  Gaussian 
series  have  been  studied  by  Chandrasekar  et  al 
(1987),  and  we  refer  to  that  paper  for  details. 


In  this  paper  we  deal  with  constructing  the 
likelihood  functions  for  the  exponential  series. 
Let  x^  be  an  n- dimens  ional  complex  vector  with 

mean  vector  c  and  positive  definite  covariance 
matrix  R.  That  is 


E(x)  -  c  and 


(1) 


E(x  -  c)(x  -  c)'  -  R  where 


(x  -  c)'  indicates  the  transpose  of  complex 

conjugate.  The  quadratic  form  (x  -  c)'  R'^  (x-c) 
is  real.  Then  we  can  write  the  density  function 
f(x)  as 

f(x)  -  — — -  exp  [-(z-c)'  R'^(z-c)]  (2) 

x  det  R 

where  f(x)  is  a  real-valued  scalar  function  of 
the  complex  vector  x,  see  Miller,  (1974).  From 
the  above  density  function  we  can  get  the  density 
of  P  by  integrating  over  the  phases  of  the  full 
complex  vector. 

The  above  model  of  the  exponential  series 
obtained  from  complex  Gaussian  fits  the  radar  data 
very  well.  The  power  received  by  radar  comes  from 
the  square  of  in-phase  and  quadrature  component  of 
the  received  signal  that  behaves  complex  Gaussian. 

The  spectra  of  radar  signals  are 
approximately  Gaussian  in  nature.  This  implies 
that  the  autocorrelation  function  is  also 
Gaussian.  These  autocorrelation  functions  can  be 
written  in  terms  of  a  spectrum  width  o  , 

sampling  time  of  the  radar  and  wavelength  of 

the  signals  X.  The  autocorrelation  function  of 
the  complex  signal  at  lag  m,  (p(m))  can  be  written 
in  terms  of  the  above  parameters  as 
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p(m)  -  exp  [-8(— e'^ 


.  4)r  V  m  T 


amplitude  and  phase  of  the  complex  signal  X.  Let 
S  be  the  inverse  of  the  covariance  matrix  with 
its  terms  defined  as 


It  can  be  shown  that  p  (m) ,  the  autocorrelation 

function  at  lag  ’m'  of  the  power  signals  is 
related  to  p(m)  as 


Pp(m)  -  |p(m)| 

3 .  Two  Step  Predictor 


Let  r^  be  amplitude  vector  that  corresponds 

2 

to  the  power  vector  P  where  P.  -  r,  .  We  can 
t  j  t  j  t 

either  predict  in  amplitude  domain  or  power 
domain.  The  amplitude  time  series  can  easily  be 
constructed  as  the  term  by  term  square  root  of  the 
power  series.  We  consider  only  univariate 
stationary  time  series  for  the  sake  of  inference 
analysis  and  these  results  can  be  extended  easily 
to  multivariate  cases  with  the  introduction  of 
appropriate  cross  correlation  functions.  We 
consider  two  predictors  here  for  comparison,  based 
on  the  second  order  moments  and  the  density 
function  as  follows: 

Predictor  I:  This  predictor  based  on  inner 
products  is  constructed  as  follows  (Brockwell  and 
Davis.  J987)  : 

‘’i+1  "  %  +  ^1  ^ 


11 

^12 

(9) 

12 

®22- 

)osit 

;ive  real 

and  Sj^2  (-s 

complex.  The  complex  density  function  now  becomes 
r.  r-  det  s  _ 

f  ’  (r^  ,  r^  ,  ,  ^2^  "  - 2 -  ® 


We  can  obtain  the  joint  distribution  of  amplitudes 
by  integrating  over  the  phases  as 

2  2 

f(ri.r2)  -  4  r^tjCdet  s)  exp  +  S22  ^2)] 


r^,  r2  >  0 


-  0  elsewhere 

where  I  is  the  modified  Bessel  function  of  the 
o 

first  kind  and  order  zero,  see  Miller  (1974). 
Similarly  it  can  be  shown  that  the  marginal 
distribution  g(r)  is  given  by 


where  a_  and  a,  are  evaluated  based  on  the 
o  1 

criteria  that  is  a  projection  of  on  the 

space  Pj^  with  the  constraint  that  “ 

to  obtain  an  unbiased  estimator.  Under 

this  condition  we  get  P^^^  as 

P.^^  -  (1  -  |p|2)  +(P|2(P.)  (6) 


where  p  is  the  lag  1  auto-correlation  of  the 
power  time  series,  see  Chandrasekar  et  al,  (1987), 
The  above  result  is  valid  for  unit  mean  and 
variance  and  for  realistic  signals  can  be  scaled 
accordingly. 

Predictor  II;  This  predictor  is  obtained 
taking  the  conditional  expectation  of 

conditioned  on 

^.1  “  ^f^uil*^i'  ('> 


g(r)  -  —  exp  {—) 
P  P 


where  p  is  the  mean  power  of  the  signal.  Using 
equations  (11)  and  (12)  we  can  write  the 
conditional  distribution  g(r2lrj^)  as 

g(r2|r^)  -  r2  det  (S)  exp  ■  r^  +  S22 

Io(2|Si2l''l''2^  (13) 

Integrating  equation  (11)  over  the  range  of  r2  (0 
to  »)  we  can  obtain  predictor  2. 

The  conditional  density  function  for  radar 
signals  can  be  obtained  from  the  autocorrelation 
description  of  Section  2  and  (13)  as 


g(r2/r^)  -  exp 

1-  IpI 


(1  -  lPl‘ 


computation  of  the  above  result  requires  knowledge 
of  the  joint  density  function  or  the  conditional 
expectation  and  is  obtained  as  follows:  Let 

Xj^  -  r^  exp  (ISj^)  and 

(8) 

X2  -  r2  exp  (192) 

where  rj  ,  9j  are  defined  as  rj  -  |Xj|  and  9j 
the  corresponding  argument  of  Xj.  This 
transformation  enables  us  to  work  in  the  domain  of 


1  -  iPl 


2>  •  'o 


ri  r2) 


The  conditional  density  given  by  (13)  can  be 
Integrated  multiplying  by  r2  to  get  the  value  of 

predictor  II. 
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(15) 


Example  1:  The  conditional  density  gCr^lr^^)  is 

integrated  over  the  entire  range  of  r2(0  to  «)  to 

compute  the  conditional  expectation  and  is 
calculated  numerically.  It  would  be  useful  to 
make  some  comments  on  this  numerical  computation. 
The  typical  measurements  made  by  radars  give  a 
2 

value  of  \p\  very  close  to  unity  and  the  term 
2 

1  -  (p(  has  to  be  handled  carefully  when  it 

in.  denomemtors .  Next  the  value  of  I 

for  large  arguments  Increase  exponentially  and 
they  have  to  be  cancelled  explicitly  with  other 
terms  to  make  the  expectation  stable. 

We  use  typical  radar  parameters  to  compute 
the  predictor  II  and  they  are  as  follows 


^i  .  2 


b  +  b,  P.  +  b,  P.^, 
o  1  1  2  1+1 


where  b  ,  b,  and  b„  are  evaluated  based  on  the 
o  i  2 

criteria  that  P.  .  is  a  proiection  of  P.  ,  „  on 

1  +  2  ej  1  +  2 

the  space  containing  P^^  and  with  the 

constraint  that  ^(^i+2^  “  ^^^i+2^  obtain 

lur  unbiased  estimator.  Under  these  conoicions  -..e 
get  bj ,  J  —0,1,2  as 

.  1  -  |P(2),|^_ 

2 

^  1  +  \pa)\ 


X  -  10  cm  (Microwave  radar  at  S  band) 

T  -  1  millisecond, 
s 

Mean  power  of  signal  is  unity. 


j,  _  |p(2)|^  -  |p.(DL^  (16) 

^  1  -  lp(l)l^ 


Figure  la  shows  the  one  step  predictor,  p^^l 
(predictor  II)  as  a  function  of  the  signal  p^^ 
with  the  spectrum  width  as  parameter.  The 

various  curve  markings  x,  +,  A, . . .  O  indicate 
values  of  <7^  -  1,  2,...  6  respectively.  We  can 

see  that  for  large  values  of  spectrum  width  the 
predictor  is  nearly  unperturbed  by  the  value  of 
pj^  but  for  narrow  spectra  the  predictor  increases 

nearly  linearly  with  p^.  This  gives  a  suggestion 

that  a  linear  predictor  may  do  almost  as  good  as 
predictor  2.  Figure  lb  shows  predictor  1  for 
radar  parameters  Identical  to  those  used  in 
predictor  2.  This  is  a  linear  predictor  and  the 
results  are  only  slightly  higher  for  all  values  of 
and  all  spectrum  width.  The  above  phenomena 

can  be  easily  explained  based  on  shape  of  the 
distributions.  Predictor  2  uses  the  information 
on  the  distribution  of  the  signals  which  falls 
exponentially  with  signal  power  and  hence  weighs 
it  lower  than  the  predictor  1  which  does  not  make 
use  of  the  shape  of  the  distribution.  However, 
the  difference  between  the  two  estimators  is 
relatively  small. 

Several  simulated  time  series  were  used  to 
test  these  two  predictors.  Mean  square  errors 
were  calculated  on  the  difference  between  the 


.  lodM^l  -  |P(2)I^ 

*’2  4 

^  [1  -  |P(1)I 

The  above  result  is  valid  for  unit  mean  of 
the  power  signal  and  for  non  unit  means  the 
predictor  can  be  scaled  accordingly. 

Predictor  II:  This  predictor  is  obtained 
similar  to  the  one  step  predictor  as 


’^i+a  "  ^^'^i+2’’^i''^i+l' 

Let  Xj  -  rj  exp  (i  Sj)  j  -  1.2,3 

Similar  to  those  discussed  in  Section  3 


11 

^12 

^13 

12 

^22 

$23 

13 

^23 

^33 

(17) 

(18) 


Then  we  can  write 
2ir 


firj^.r^.r^)  - 


det  s 
4 


X'  S  X  d9j  dS^  <193  (19) 


known  signal  ^"<1  1*1  +  1  “PP^yl^B  '-1'® 

predictors  on  simulated  series,  see  Chandrasekar 
et  al.,  (1987).  The  difference  between  the  mean 
square  errors  for  the  two  predictors  were  small 
leading  us  to  conclude  that  the  two  one  step 
predictors  perform  similarly. 

4 .  Two  Step  Predictor: 

The  two  step  predictor  in  principle  is  an 
extension  over  the  one  step  predictor  but  gets 
complicated  quickly.  We  again  consider  two 
predictors  here  similar  to  section  3  based  on 
second  order  moments  and  density  function. 

Predictor  I :  This  predictor  is  based  on 
inner  products  and  is  constructed  as  follows: 


Equation  (19)  can  be  reduced  after  some  algebraic 
manipulations  as,  (Miller,  1974) 


f(rj.r2,r^)  -  8(dets)  t^r^r^ 


exp 


[-(r'lSii  + 


•  }  'm(2V2>Sl2l>y2V3|S23l)  . 

m-0 

I|j(2r3ri  •  [Sj^Dcos  m  *  ^23  *  ^31^ 

where  ^23  ^31  phases  of 

^23 ’^31  ■  'm  ”  ^  m-0  and  2  for 

ra  -  1  2  .,  and  1  is  the  modified  Bessel 

m 

function  of  the  first  kind  and  order  m. 
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We  can  write  predictor  2  as 


r 


i+2 


El 


f(rpr^)  ' 


(21) 


Equations  (20)  and  (21)  indicate  the 
complexity  involved  in  computation  of 

integrant  in  compu'‘ing  the  expectation  is  our 
infinite  sunimatior  ntaining  terms  with  modified 
P.-ssel  function.  ^  is  simpler  numerically  to 
compute  the  three  dimensional  integration  in  bq. 
(19)  and  then  use  it  in  (21)  to  compute  the 
expectation.  This  integration  of  the  complex 
density  function  turns  out  to  be  computationally 
intensive  and  more  involved  than  one  step 
predictor . 

Example  2:  This  example  is  constructed  for 
the  same  radar  parameters  as  in  example  1.  Figure 

2  shows  the  two  step  predictor  ^  function 
of  Pot  ^  mean  power  of  unity.  The 

continuous  curve  shows  the  results  of  predictor  2 
whereas  the  points  indicate  predictor  1.  We  can 
see  from  Fig.  2  that,  though  there  is  some 
difference  between  the  two  predictors  for  small 
values  of  Pj^+^.  the  overall  agreement  seems  good 

between  the  two  predictors.  We  cannot  make  a 
conclusive  statement  based  on  this  example; 
however  we  see  the  trend  same  as  in  example  1,  (i- 
e)  the  two  predictors  perform  similarly.  This 
observation  is  important  in  the  context  of  the 
computational  complexity  involved  in  obtaining 
predictor  2. 


Conclusions  and  Discussioni 

We  have  discussed  some  inference  techniques 
for  a  class  of  exponential  time  series  in  the 
context  of  radar  signals.  The  exponential  time 
series  is  constructed  from  complex  Gaussian  time 
series  with  arbitrary  correlation  structure. 

Two  predictors  have  been  considered  for 
analysis,  one  based  on  second  order  moments 
(predictor  I)  and  the  second  based  on  conditional 
expectation  (predictor  II).  These  two  predictors 
have  been  derived  for  one  step  and  two  step 
prediction.  The  amount  of  complexity  involved  in 
higher  order  predictors  for  predictor  2  is 
exhibited  clearly  by  a  comparison  of  the 
corresponding  one  step  and  two  step  predictors. 
The  moment  method  predictors  are  computationally 
simple  compared  to  predictor  II.  The 
computational  complexity  of  two  step  predictor  11 
is  an  order  of  magnitude  more  than  that  of  one 
Step  Predicator  II.  Real  time  applications  of 
predictor  II  will  be  possible  only  through  a  pre¬ 
calculated  look  up  table  system  since  these  are 
computationally  intensive.  The  smooth  variation 
of  these  predictors  as  observed  in  examples  1  and 
2  indicate  that  we  may  not  need  too  many  entires 
In  the  look  up  table  and  can  possibly  be 
interpolated. 

We  have  done  a  mean  square  error  criteria 
evaluation  of  these  two  predictors  for  one  step 
prediction  based  on  simulation  and  they  both  give 
nearly  equal  mean  square  errors.  The  example  for 
two  step  predictor  shows  that,  over  a  wide  range 
on  the  average, the  two  predictors  give  similar 


values.  In  real  time  applications  in  a  radar, 
simplicity  of  computation  is  as  important  as 
accuracy  as  long  as  we  can  keep  making  real-time 
updates  of  observational  data.  Thus  based  on  the 
above  observations  and  radar  system  constraints 
our  initial  suggestions  is  that  moment  based 
predictor  I  is  suited  well  for  radar  applications. 
The  time  series  studied  here  is  reversible  and 
hence  all  the  results  discussed  here  can  easily  be 
extended  to  other  inference  problems. 
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ALTERNATIVE  METHODS  FOR  COMFHJTING  THE  THEORETICAL  AUTOCOVARIANCE  FUNCTION 
OF  MULTIVARIATE  ARMA  PROCESSES:  A  COMPARISON 
Stefan  Hittnik,  SUMY  at  Stony  Brook* 


Abstract 

Matrlcial  expressions  relating  the  theoretical 

du t oou  vai' 1  aiices  ui  t,.ui  i.  x  «cii' Ic^tu  uC*  'l  ugrcssivo 

moving  average  processes  to  the  parameters  of  the 
process  are  employed  to  derive  a  framework  which 
unifies  alternative  algorithms  for  computing 
theoretical  autocovariance  functions. 

1.  introdiittion 

The  theoretical  autocovariance  function  of 
autoregressive  moving  average  (ARMA)  processes 
has  to  be  computed  frequently  in  applied  time 
series  analysis.  For  example,  exact  maj<lmum 
likelihood  estimation  procedures  for  ARMA  models 
require  the  derivation  of  theoretical  autoco- 
vai  lancoc  in  each  iteration  of  the  maximization 
algorithm  (see,  for  example,  Wlchoils  and  Hall, 
1979;  Gardner  et  al . ,  1980;  Shea,  1987;  Mittnik, 
1988a).  Theoretical  autocovariances  are  also 
needed  in  distributional  analyses  of  estimated 
ARMA  parameters  (Hannan,  1970)  and  for  correctly 
Initializing  simulations  with  ARMA  models  (Ansley 
and  Newbold,  1980;  Woodfield,  1988). 

The  problem  of  computing  the  theoretical 
autocovariance  function  amounts  to  specifying  and 
solving  a  system  of  llneair  equations.  Algorithms 
for  univariate  ARMA  processes  have  been  suggested 
in  tx;Leod  (1975,  1977)  Akaike  (1978)  and 

Tunnicllffe  Wilson  (1979).  In  the  multivariate 
case  the  specification  of  the  system's 
coefficient  matrix  represents  a  major  difficulty. 
Nlcholls  and  Hall  (1979)  present  an  algorithm 
for  the  multivariate  processes.  The  algorithm 
given  in  Pate  and  Davies  (1988)  is  essentially 
equivalent  to  the  one  of  Nlcholls  amd  Hall. 
Ansley  (1980)  amd  ICohn  and  Ansley  (1982)  propose 
slightly  more  efficient  algorithms  by  eliminating 
some  of  the  unknowns  in  the  equation  system.  By 
developing  a  closed  form  matrlcial  relationship, 
v.-hlch  expresses  theoretical  autocovam  lances  of 

•  Department  of  Economics,  SONY  at  Stony  Brook, 
Stony  Brook,  NY  11794-4384. 


multivariate  ARMA  processes  in  terms  of  the  ARMA 
parameters,  Mittnik  (1988b)  presents  a  procedure 
which,  compared  to  the  orevious  aoproaches, 
reduces  the  number  of  unknowns  in  the  system  by  a 
factor  of  about  two. 

In  this  paper  we  employ  the  matrlcial 

expressions  in  Mittnik  (1988b)  to  compare  the 

alternative  approaches  to  computing  theoretical 
autocovariance  functions  of  multivariate  ARMA 
processes.  While  the  earlier  algorithms  require 
rather  complex  indexing  schemes  to  set  up  the 

coefficient  matrices  (see,  for  example,  Ansley, 
1980;  Kohn  and  Ansley,  19G2:  Pate  and  Davies, 

1988),  the  results  derived  here  yield  closed  form 
expressions  for  the  coefficient  matrices  of  the 
respective  algorithms. 

2.  ARMA  COEFFICIENTS  AND  AUTOCOVARIANCES 

Assume  the  stationary  zero  mean  non- 
determinist  ic  time  series  (y^).  is 

generated  by  the  ARMA(p,q)  process 

A(L)y^  =  BlDCj,.  (2.1) 

where  A(L)  is  a  stable  matrix  polynomial  in  the 

2 

lag  operator  L  defined  by  A(L)=I-AjL-A2L  . . 

-A  L^,  and  B(L)=B„+B,L+.  .  . +B  Process  ic.)  is 

p  0  1  q  ^  t 

white  noise,  i.e.  E(c^)=0  and  E(e^c^  )=5^j^E.  Note 
that  without  loss  of  generality  either  B^  or  Z 
can  be  assumed  to  be  an  Identity  matrix.  Unless 
stated  otherwise  we  set  Z=I. 

It  is  well  known  that  given  the  initial 
aulocovar lances  r^=E(y^y^_^)  (t=0, . . . , p-1 )  of  an 
ARMA(p.q)  process,  higher  order  autocovariances 
can  be  determined  recursively  applying  the 
modified  Yule-Walker  equations.  From  the 

definition  of  the  autocovariance  it  follows  that 


r  =  A.r 

T  1  T-1 


^A  r  +E(B  c  y'  +...+B  c  yl  ) 
p  T-p  0  f't-T  q  t-q't-T 


(t=0,  1 _ ) 


(2.2) 
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Replacing  y.  _  in  (2.2)  by  its  moving  average 
^  -1 

representation.  (l-)B(L)c^_^=C(L)c^_^, 

where  C(L)=Cq+CjL+.  .  .  ,  eind  recalling  the  unit 
vairlance  assumption,  we  can  write 


^‘^-i^t-T’ 


'^i-T-  . ^ 

0,  otherwise. 


HeDninp  r=(r^  f!  ...  r' t\  r*=(r„  r,  ...  r  )\ 

T  T  “t  •  P  °  p 

C=(Cq  Cj  ...  C^)  ,  C  =(Cq  Cj  ...  C^)  and  using 
the  fact  that  f  ,=r'  allows  us  to  rewrite 

T-1  l-T 

(2.2)  in  matrix  terms  as  (Mittnik,  1988b) 


r  =  Hj.r  +  Mj^r*  +  nc* 


(2.3) 


where  the  m(p+l )xm(p+l ) 
defined  ais 


matrices 


Mjj  and  Kj. 


are 


- » 

o 

X 

1  1 

■  0  0 

ft 

o 

o 

II 

T  0 

Matrix  H  denotes  the  Hankel  matrix 


N  = 


"O  ^  Vl 

Bj  0 


BO  ...  0 

q 


m(p-q)xm(q+l )  ^ 


If  p2q. 


Note  that  if  £*1,  matrix  C  is  defined  by 
C  =(CqI:  ...  Expression  (2.3)  relates  the 
theoretical  autocovariances  to  the  autoregressive 
coefficients  and  the  coefficients  of  the  puie 
moving  average  representation.  A  matriclal 
expression  purely  in  terms  of  the  ARMA  parameters 
can  be  found  in  Mittnik  (1988b). 


3.  COMPARISON  OF  ALGORITHMS 

Nicholls  and  Hall's  (1979)  procedure  for 
computing  the  initial  p+1  theoretical  auto¬ 
covariance  matrices  amounts  to  vectorizing  the 
transposition  of  (2.3),  yielding 

r  =  (M,eljT  *  (M„olJUp^jr  *  6,  (3.1) 


H  = 


0 


0 


and  T  is  the  lower  triangular  Toeplitz  matrix 


r  *  7  T 

where  y=vec(r  ),  6=vec(C  N  ),  and  matrix  W  = 

p+1 

I  ,<sW  is  a  0-1  commutation  matrix  such  that 
P+  1  0 

vec(r^)=W  ,vec(r  ^)  and  vec(r( )=Wvec(r,  ) .  By 
p+1  i  i 

defining 

M  =  1  -  M^EI  -  :m,,o!  )V  , 

I  m  11  m  p+1 

we  obtain  the  linear  equ  ion  system 


A 


1 


T  = 


A 


P 


finally,  the  m(p+l )xm(q+l )  matrix  N  is  defined  by 


N  = 


B 

q 


B 

q 

0 


if  P<q; 


iB 

P 


p+1 


B  0  .  .  .  0 

q 


My  =  6,  (3.2) 

whose  solution  provides  the  elements  of  the 

initial  p+1  autocovariance  matrices.  The 

approaches  of  Nicholls  and  Hall  (1979)  and  Pate 

and  Davies  (1988)  correspond  to  solving  the 

2 

m  (p+l )-dimensional  system  (3.2). 

Using  the  fact  that  f^  is  symmetric  and 

extending  the  univariate  results  of  McLeod 

(1975),  Ansley  (1980)  reduces  the  size  of  the 
system  by  eliminating  the  ro(m-l)ro/2  redundant 
elements  in  Fq.  Letting  y'  denote  the  vector 
obtained  by  eliminating  the  redundant  elements  in 
y,  we  can  define  the  0-1  matrices  Sj  and  such 

that  y'=Sjy,  6'=Sj5  and  y=S2yV  The  equation 

system  providing  the  solution  for  y'  is 
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SjMS^r’  =  sV 


(3.3) 


Expression  (3.3),  involving  mp+(m+l)m/2  unknowns, 
is  equivalent  to  Ansley’s  (1980)  approach.  The 
coefficient  matrix,  which  he  constructs  with  a 
rather  complex  Indexing  scheme,  corresponds  to 
matrix  SjMS^  in  (3.3). 


Observing  that 


p-1 

=  E  A  .r.  +  K  , 

1=0  '  '  P 


(3.4) 


where 


K  = 
P 


q  r 

Z  B.c!  , 

i=p  ^  i-P 


if  piq 


if  P>q, 


(3.S) 


and  that  affects  only  P^  in  (2.3),  Kohn  and 

Ansley  (1982)  eliminate  P  from  the  equation 

P 

system.  In  our  framework,  this  Is  accomplished 
by  substituting  (3.4)  for  P^  in  the  RHS  of  (2.3). 
Defining  j'=vec(PQ  pj  ...  Pp.j'.  we  can  write 


i  =  (M.j.®nr  +  (Mj^snWpr  *  MpV^  +  5, 

where  ^  and  are  obtained  by  deleting  both  the 

last  block  row  and  last  block  column  of  Mj.  and 

M  ,  respectively, 

H 


3  =  vec(K  A^+K,I  k!  ...  k’  ,  ), 

p  p  0  1  p-1 


with  Kj  denoting  the  i^^  (block)  entry  in  NC  , 


M  =  (A  10  ,  ,  , )  0(A  A  ....  A, 1 

p  p  mxm(p-l)  p  p-1  1 


W  =I  ©W,  and  matrix  V  is  defined  such  that 
P  P 

vec;p,^..p’  ,  )^=Vvec(P,!.  .  .  P’  Let 

O  p-1  0  p-1 


M  =  I  -  PLj.01  +  Mj^ol  +  M 


and  define  j  as  the  vector  obtained  by  deleting 
the  redundant  elements  in  f.  Moreover,  define 
the  0-1  matrices  and  such  that  3'^=S^3r, 
6  and  .  The  theoretical  autoco- 
varlances  P„,  ...P  .  are  calculated  by  solving 

0  p- 1 


(3.6) 


a  system  with  m  p-m(m-l)/2  unknowns.  Computing 
the  autocovariances  via  (3.6)  corresponds  to  the 
algorithm  in  Kohn  and  Ansley  (1982). 

Making  use  of  the  particular  structure  of 
(2.3),  Mittnik  (1988b)  proposes  a  more  efficient 


algorithm. 


Partitioning  P  such  that  P= 


(p'^  P^^)\  with  p’  =  (p’...pM\  P^=  (P^  ...pM\ 

0  s  s+1  p 

where 


=  {  A 


if  p  is  even 


2  ■ 


if  p  is  odd. 


enables  us  to  rewrite  (2.3)  as 


T  0 
11 

T  T 
21  22 


where  matrices 


«11  "l2 
"21  ° 


H. 


rp-ii 


and 


r,ii 


(3.7) 


denote 


ij’  "ij' 

conformable  submatrices  of  M.^,  M^^,  and  NC  , 

respectively.  By  defining 


r  =vec(P^')=W  vec(P  ^^) 
'  „2T, 


y.=vec(P  )=W  vec(r 

<:  p-S 

Ti=r„el 

VT21®’ 

"3=^22®’ 


we  can  write 


5,=vec(K  ) 
62=vec(K‘^) 

V'"21®'>Vs 

Vf"i2®'>"s.r 


(I-Tj-Hj)t,  =  H3y2  ^  6^ 


(3.8) 


''-"3)^2  =  ‘V"2'^l  ^  "2- 


S'lbst  i tut i ng 


y2=(l-T3)  ‘|(T2.H2)^,  ^ 


(3. 10) 


into  (3.8)  and  eliminating  again  the  redundant 
elements  in  by  defining  Sg  and  Sg  such  that 


1  6“  1  ■ 


gives 


ViVi  =  Vo' '-^3’'  S,  ^ 


(3.11) 


where 


M  =  1  -  T  -  H  -  lUd-T.,)  ‘(T„-H„). 
I  11  3  3  2  3 


Once  y^  has  been  computed,  y^  can  be  obtained 
either  recursively  (Mittnik,  1988b)  or  from 
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(3.10).  The  number  of  unknowns  in  system  (3.12) 

2  2 
is  (m  p+m)/2  if  p  is  odd  and  (m  (p-l)+m)/2  if  p 

is  even.  Thus,  for  large  values  of  p  the  number 

of  unknowns  in  (3.12)  is  substzint  ial  ly  less  than 

for  ainy  of  the  other  methods. 

The  ratio  of  the  elementary  multiplications  of 

the  (previously  most  efficient)  algorithm  by  Kohn 

and  Ansley  (1982)  over  the  ones  required  in 

Mittnik  (1988b),  reported  in  Table  1,  indicates 

the  comnutat lonal  savings  of  the  latter  approach. 

Note,  the  fact  that  the  construction  of  the 

coefficient  matrices  in  (3.11)  are  more  complex 

has  to  be  taken  into  consideration.  Using, 

however,  Akaike's  (1973)  block-Levinson 

algorithm,  the  inversion  of  of  I“T  requires  only 

2  ^ 

0(p  )  operations. 

Table  1:  Comparison  of  Computational  Complexity 


m 

P 

Ratio 

1 

2 

7.  0 

3 

4.2 

4 

6,  0 

6 

6.2 

2 

2 

9.5 

3 

3.5 

4 

8.4 

6 

8.  2 

3 

2 

12.  7 

3 

3.  8 

4 

9.  7 

6 

9.  0 

4 

2 

15.2 

3 

4.  1 

4 

10.5 

6 

9.5 

5 

2 

17. 0 

3 

4.  2 

4 

11,0 

6 

9.  8 

fxrtjor  variables 
p:  autoregressive  order 

Ratio:  ratio  of  elementary  muttipl icarions 
KofinXAnsley  ( 1982)/Mi  ttnik  (1986b} 


4.  CONCLUSIONS 

Making  use  of  a  matrlclal  expression  relating 
the  autocovariances  of  an  ARMA  process  to  the 
ARMA  parameters,  a  general  framework  for 
alternative  approaches  to  computing  the 


theoretical  autocovariance  function  of  multi¬ 
variate  ARMA  models  has  been  provided  The 
results  facilitate  the  implementation  of  these 
algorithms  by  deriving  closed  form  expressions 
for  their  respective  coefficient  matrices  instead 
of  using  complex  indexing  schemes. 
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XVI.  RELIABILITY  AND  LIFE  DISTRIBUTIONS 
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INCREASING  RELIABILITY  OF  MULTIVERSION  FAULT-TOLERANT  SOFTWARE  DESIGN  BY  MODULATION 


Junryo  Miyashita, California  State  University  at  San  Bernardino 


Abstract: 

One  of  the  problems  of  the 
multi-version  fault-tolerant  software  is 
the  high  cost  of  development.  This  paper 
addresses  that  problem  .  Rather  than 
working  on  the  common  requirement 
specification  for  a  whole  program,  teams 
of  programmers  will  work  on  the  common 
specifications  for  each  module  in  each 
version  of  a  program.  One  version  of  a 
program  consists  of  a  set  of  modules. 
This  will  enable  the  modules  in  each 
version  to  be  interchangeable.  The 
effects  on  the  reliability  by  such 
modularization  scheme  are  studied. 
Theoretical  reliability  of  modularized 
N-version  Programming  and  Recovery  Block 
are  derived  in  closed  forms  assumming 
independence  among  different  modules. 
Numerical  results  show  substantial 
increase  in  the  reliability  in  both 
N-version  Programming  and  Recovery  Block 
schemes . 

Section  One  :  Introduction. 

Fault-tolerance  in  software  is 
achieved  by  introducing  redundancy .There 
are  three  well  known  designs  : 
N-version  Programmings,  Recovery  B'lock, 
and  Consensus  Recovery  Block.  All  three 
designs  effect  reliability  by  using 
multiple  implementations  of  a  common 
requirements  specification. One  problem 
with  multiversions  is  the  high 
development  cost. This  paper  addresses 
the  problem. 

Reliability  is  increased  without  the 
increase  in  costs  by  delaying  the 
introduction  of  redundancy  until!  later 
in  the  development  cycle.  Redundancy  is 
introduced  after  completion  of  a  modular 
design  which  provides  a  common  set  of 
module  specif icaitons .  Multiple  versions 
of  each  module  are  then  implemented;  as 
opposed  to  multiple  versions  of  an 
entire  program.  Each  version  of  module 
is  interchageable  with  every  other 
version.  Multiversions  of  a  program  are 
built  by  assembling  different  version  of 
the  requisite  modules.  Theoretical 
studies  provide  further  confidence  in 
the  efficacy  of  this  approach. 
Theoretical  reliabilities  of  modularized 
, multiversion  fault-tolerant  software 
are  derived  in  closed  form  assumming 
the  independence  among  modules.  The 
effects  of  modularization  on  reliability 
for  the  multiversion  designs  are 
calculated.  The  result  show  substantial 
increases  in  reliability  for  each. 

Section  Two  :  Effect  of  modularization 
of  N-version  programming. 
Part  One:  General  Formula. 

We  shall  assume  that  we  hive  N 
independent  versions  of  a  software  and 
in  each  version,  there  are  M  independent 
modules . 

In  N-version  programming,  we  shall 
assume  that  we  have  a  correct  output 
when  two  of  the  output  agree.  Mow  the 


effect  of  modularization  is  that  as  long 
as  any  two  versions  of  different  modules 
are  correct  then  the  two  outcomes  of 
those  modules  are  assumed  to  be  correct. 
In  other  word,  the  reliability  of 
non-modulalized  N-version  programming  is 
that  Pr{at  least  two  versions  of  each 
modules 's  outputs  agree  for  all 
modules. ) . 

We  also  note  here  that  we  can  process 
all  the  possible  permutations  (ie. 
N-versions,M  modules  :  NM  permutaions  ) 
and  add  to  each  module  the  information 
as  to  which  version  it  is. Then  if  there 
are  agreements  in  output,  we  can  check 
if  agreements  come  from  "independent" 
permutations  :  permutations  which  do  not 
share  any  common  version  at  any  module 
level . 

We  shall  define  following  terms. 
R(i,j)  =  Reliability  of  i-th  versions  of 
j-th  module. 

R(i)  =  Reliability  of  i-th  version 

Rnvp  =  Reliability  of  N-version 
programming  without 
modularization. 

rnvP-M  =  Reliability  of  N-version 
Programming  with 
modularlization. 


Then  we  have, 

RNVP  =  Pr(  At  least  two  versions 
agree (correct)  ) 

N 


=  1  -  [  TT  (l-R(i))  + 

i=l 


N 


S  (RCDglJ,.  (l-R(s))  )  ] 
i=l 


(1) 


RjfVp-M  =  pr{  At  least  two  versions 
are  correct  at  each  module 
level  ) 


Pr{  At  least 
correct  at  j 
1-  Pr(0  error) 


two 

-th 


=  j))  + 

3=1  1=1 


versions 
level  1 
Pr(l  error} 


are 


(R(i, j)*  ", (l-R(i, j) ) 

1—  i  S  ^  X 


(2) 


Part  Two:  Numerical  Result  on  special 
case  when  reliability  of  each  module  are 
the  same.  (i.e.  R(i,j)  =  R  for  all  i 
and  j  ) • 


If  R(i,j)  =  R  for  all  i  and  j  then, 
RNVP  =  I  -  [  (l-RM)N-l  *  (1+(N-1)*RM)  ) 
RNVP-M  =  [  1  -  (l-R)N-l  *  (1+{N-1)*R)  ]M 
The  three  table  of  values  of  RNVP 

,RNVP-M  and  the  ratio  — i-^NVR-  is  given 

at  next  pages  (Table  l,5,an3  3).  The 
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and 


RRB  =  1  -  (l-RM)N 


last  table  is  for  the  ratio  of  errors 
where  the  numerator  is  the  probability 
of  failure  in  N-version  programming 
without  modulation  and  the  denominator 
is  the  probability  of  error  with 
modulation.  The  results  show  the 
substantial  increase  in  reliability.  For 
example,  when  the  reliability  of 
module  is  constantly  .98  and  8  module 
with  each  having  5  versions  .will 
increase  the  reliability  327  times.  We 
must  remember  here  that  we  assummed 
independence  among  modules  which  may  not 
likely  to  be  true. 

Section  Three.  Modularization  effect  on 
Recvovery  Block  Model. 
Part  One:  Comments  on  general  case. 

In  Recovery  Block  Model,  We  have 
additional  factors  to  consider,  namely 

1)  The  reliability  of  acceptance 
test(s) , 

2)  Recovery  reliability  . 

3)  Arbitrary  ordering  of  versions  to  be 
excecuted. 

In  this  paper,  we  shall  have  a  simple 
assumption  that  acceptance  test  has 
perfect  reliability  as  well  as  recovery. 
Then  the  ordering  of  versions  does  not 
affect  the  reliability  of  the  entire 
scheme  because  the  probability  that  at 
least  one  correcct  version  existing 
becomes  the  reliability  of  the  Recovery 
Block  scheme. 

We  shall  define  following  terms. 

RRB  =  Reliability  of  Recovery  Block 
Model  without  modularization. 

RRB-M  =  Reliability  of  Recovery  Block 
Model  with  modularization. 

Then  we  have, 

RRB  =  Pr{At  least  one  version  is  correct 


=  1  -  1  -  R(i)  ) 


Here  R(i)  =  Reliability  of  one  version 

=  .  I  ^R(i,j) 


PP3-M  =  Pr(  At  least  one  version  is 
correct  for  each  module  ) 

Pr(  At  least  one  version  is  correct 

at  j-th  module  level  ) 


(1  -  Pr{No  correct  version  at  j-th 
module  level )  ) . 


=  -  R(i,j))) 

Now  if  we  assume  that  R(i,j)  =  R  for  all 
i  and  j  then  , 


RRB-M  =  (  1  -  (1-R)N)M 
Again,  the  table  of 


1  -  R 


iBH. 


'^RB  '  %B-M ' 

j  are  given  (Table  4,5  and  6). 

The  show  the  substantial  increase 

in  the  reliability  by  modularization. 
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r  =  reliability  of  one  module  in  one  version 
N  =  number  of  independent  versions 
M  =  number  of  module  per  version 

r  =  .94 


N\M 

1  2 

;i 

4 

5 

6 

7 

8 

2 

.780749 

. 6898698 

. 609569 

.5386151 

.4759203 

.4205232 

.  371574 

3 

.9625073 

.9236198 

.8768662 

.8252617 

.7711148 

.7161698 

. 661722 

4 

.9942423 

.9830212 

.964774 

.939675 

.9084379 

.8720594 

.831647 

5 

.9991676 

.9964392 

. 9904725 

.9802682 

.9652221 

.9451244 

.920105 

r  = 

.96 

N\M  1 

1  2 

3 

4 

5 

6 

7 

8 

2 

.8493465 

.7827577 

.7213896 

.6648326 

.6127097 

.5646732 

.520402 

3 

.9825241 

.9632053 

.9387492 

.9103251 

.8789226 

.8453752 

.810382 

4 

.9981858 

.994404 

.9878682 

.978312 

.9656716 

.9500289 

.931569 

5 

.999823 

.9991988 

.9977348 

.9950484 

.9907989 

.9847116 

.976587 

r  = 

.98 

N\M  1 

1  2 

3 

4 

5 

6 

7 

8 

2 

. 9223682 

.8858424 

.8507631 

.817073 

.784717 

.7536421 

.723798 

3 

. 9954198 

.9900316 

.9828557 

.9740802 

.9638796 

.9524142 

.939832 

4 

.9997589 

.9992223 

.9982375 

.9967079 

.9945587 

.9917337 

.988193 

5 

.9999881 

.999943 

.9998296 

.9996067 

.9992284 

.9986473 

.997815 

Table 

1:  Reliability  of  non-modulated  version 

when 

reliability  of  each 

module  is 

r  and 

n  =  number  of  version  and 

m  =  number  of  modules 

r  = 

.94 

N\M| 

2 

3 

4 

5 

6 

7 

8 

2 

.780749 

.6898698 

. 609569 

.5386151 

.4759203 

.4205232 

.3715743 

3 

.9793715 

.9692174 

.9591685 

.9492238 

.9393822 

.9296428 

.9200042 

4 

.9983503 

.9975265 

.9967034 

.9958809 

.9950591 

.9942381 

.9934176 

5 

.9998766 

.9998149 

.9997532 

.9996916 

.9996299 

.9995682 

.9995066 

r  = 

.96 

N\M| 

2 

3 

4 

5 

6 

7 

8 

2 

.8493465 

.7827578 

.7213896 

.6648326 

.6127097 

.5646733 

.5204029 

3 

,9906777 

.9860492 

.9814423 

.9768569 

.972293 

.9677504 

.9632291 

4 

.9995034 

.9992552 

.9990071 

.9987591 

.998511 

.9982631 

.9980152 

5 

.9999752 

.9999628 

.9999504 

.999938 

.9999256 

.9999132 

.9999008 

r  = 

.98 

N\M| 

2 

3 

4 

5 

6 

7 

8 

2 

1  .9223682 

.8858425 

. 8507631 

.817073 

.7847169 

.7536421 

.723798 

3 

.9976334 

.9964522 

.9952725 

.9940941 

.9929172 

.9917416 

.9905674 

4 

.9999369 

.9999054 

.9998739 

.9998424 

.9998108 

.9997793 

. 9997478 

5 

.9999983 

.9999975 

.  9999967 

.9999958 

.999995 

.9999942 

.9999933 

Table  2:  The  reliabilities  of  modulated  versions. 
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r  =  .94 


N\M| 

1  2 

3 

4 

5 

6 

7 

8 

2 

1 

1 

1 

1 

1 

1 

1 

3 

1.817518 

2.481277 

3.015656 

3.441344 

3.775878 

4.034128 

4.22868 

4 

3.490191 

6.864355 

10.68551 

14.64541 

18.53157 

22.20449 

25.5763 

5 

6.74686 

19.23961 

38.60991 

63.96986 

93.95765 

127.0928 

161.904 

r  = 

.96 

N\M| 

i  2 

3 

4 

5 

6 

7 

8 

2 

1 

1 

1 

1 

1 

1 

1 

3 

1.874644 

2.637466 

3 .300566 

3.874808 

4.369925 

4.794632 

5.15672 

4 

3.653583 

7.513806 

12.21863 

17.47666 

23.05496 

28.7696 

34.4768 

5 

7.139423 

21.54167 

45.67789 

79.87885 

123.6923 

176.1648 

236.052 

r  = 

.98 

N\Ml  2 

3 

4 

5 

6 

7 

8 

2 

1 

1 

1 

1 

1 

1 

1 

3 

1.935422 

2.809798 

3.626523 

4.388822 

5.099723 

5.762057 

6.37866 

4 

3.822306 

8.221173 

13.97448 

20.88166 

28.76182 

37.45234 

46.8071 

5 

7.142857 

22.76191 

51.03572 

94.27143 

154.1191 

231.5714 

327 . 160 

Table  3: 

This  table  shows  the  ratio  of  probabilities  of  errors 
The  denominator  is  the  probability  of  error  when  modulation 
is  introduced  and  the  numerator  is  when  no  modulation  is 
introduced 

r  =  .94 


N 

(  2 

3 

4 

5 

6 

7 

8 

2 

.986451 

.9712982 

.951929 

.9291929 

.9038192 

.  8764319 

.  847563 

3 

.9984229 

.9951374 

.9894604 

.9811585 

. 9701714 

.9565631 

.  940484 

4 

.9998164 

.9991762 

.9976891 

.9949864 

.9907492 

.9847309 

.976763 

5 

.9999787 

.9998604 

.9994934 

.9986659 

.9971311 

.9946326 

.990927 

r  = 

.96 

N\M  1 

1  2 

3 

4 

5 

6 

7 

8 

2 

.9938534 

.9867142 

.9773035 

.9659127 

.9528058 

.9382216 

.922376 

3 

.9995181 

.9984686 

.9965''08 

.9937066 

.9897474 

.9846448 

.978373 

4 

.9999622 

.9998234 

.9994849 

.9988381 

.9977727 

.9961834 

.993974 

5 

.999997 

.9999797 

.9999224 

.9997854 

.9995161 

.9990513 

. 998321 

r  = 

.98 

N\M 

1  2 

3 

4 

5 

6 

7 

8 

2 

.9984318 

.9965416 

.9939733 

.9907688 

.9869681 

.9826091 

.977728 

3 

.9999379 

.9997966 

.9995321 

.9991131 

.9985122 

.9977066 

.996676 

4 

.9999975 

.999988 

.  9999636 

.9999148 

. 9998301 

. 9996976 

.999504 

5 

.9999999 

.9999993 

.9999971 

.9999918 

.9999806 

.9999601 

.999926 

Table  4:  The  reliabilities  of  non-modulated 
Recovery  Block 
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r  =  .94 


N\Mi 

1  2 

3 

4 

5 

6 

7 

8 

2 

.9928129 

.9892388 

.9856775 

.9821291 

.9785934 

.9750704 

.9715601 

3 

.9995681 

.9993521 

.9991362 

.9989204 

.9987047 

.9984889 

.9982732 

4 

.999974 

.999961 

.999948 

.999935 

.9999221 

.9999091 

.9998961 

5 

.9999983 

.9999975 

.9999967 

.9999958 

.999995 

.9999942 

. 9999933 

r  = 

.96 

N\M| 

1  2 

3 

4 

5 

6 

7 

8 

2 

.9968025 

.9952076 

. 9936152 

.9920254 

.9904382 

.9888534 

.9872713 

3 

.999872 

.9998079 

.9997439 

.9996799 

.9996159 

.9995519 

.9994879 

4 

.9999949 

.9999923 

.9999898 

.9999872 

.9999846 

.9999821 

.9999795 

5 

.9999998 

.9999996 

.9999995 

.9999994 

.9999993 

.9999992 

.9999991 

r  = 

.98 

N\M| 

2 

3 

4 

5 

6 

7 

8 

2 

.9992002 

.9988004 

.998401 

.9980016 

.9976024 

.9972034 

.9968046 

3 

.999984 

.9999761 

.'  99681 

.9999601 

.9999521 

. 9999441 

.9999361 

4 

.9999996 

.9999994 

.9999993 

.9999991 

.9999989 

.9999988 

.9999986 

5 

1 

1 

1 

1 

1 

1 

1 

Table  5:  The  reliabilities  of  modulated  versions 
of  Recovery  Block. 


r  =  .94 


M 1  2 

3 

4 

5 

6 

7 

8 

2  1.885187 

2 . 667154 

3.356325 

3.962131 

4.493034 

4.956684 

5.359952 

3  3.651166 

7.50506 

12.20247 

17.45296 

23.0278 

28.74531 

34.4671 

4  7.06422 

21.13303 

44.46101 

77.16973 

118.656 

167.8716 

223 . 5379 

5  12.78571 

55.76191 

151.7857 

319.7572 

573.0119 

918.8776 

1359.018 

r  =  .96 

N\M|  2 

3 

4 

5 

6 

7 

8 

2  1.922304 

2.772272 

3.554809 

4.274503 

4.935713 

5.542383 

6.098332 

3  3.763967 

7.97393 

13.35335 

19.6622 

26.69305 

34.27139 

42.23467 

4  7.372093 

22.96124 

50.24419 

90.66977 

144.8372 

212.7309 

293.8663 

5  12.5 

56.83333 

162.75 

360 

676.5 

1136.857 

1760.313 

r  =  .98 

N\M|  2 

3 

4 

5 

6 

7 

8 

2  1.960653 

2.883081 

3.769002 

4.619382 

5.435426 

6.218457 

6.969764 

3  3.88806 

8.487562 

14 . 64552 

22.20896 

31.04478 

41.02026 

52.01866 

4  7 

22.33334 

50.83333 

95.33334 

158.3333 

241.6191 

346.75 

5  large 

large 

large 

large 

large 

large 

large 

Table  6:  The  ratio  of  errors  :  The  numerators  is  the 
probabilities  of  errors  for  non-modulated 
version  and  the  denominators  are  for  modulated 
version. 
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LINEAR  PREDICTION  OF  FAILURE  TIMES  OF  A  REPAIRABLE  SYSTEM 


M.  Ahsanullah,  Rider  College 


1.  ABSTRACT 

Suppose  we  consider  a  repairable  system  in 
which  a  failed  component  is  replaced  immediately 
by  a  component  of  equal  age.  On  replacement  of 
the  component,  the  system  becomes  operational 
and  the  repairing  time  of  the  component  is 
assumed  to  be  negligible.  We  assume  the  survival 
times  of  the  components  are  independent  and 
identically  distributed.  Some  distributional 
properties  of  the  n-th  survival  time  are 
discussed  when  the  survival  times  have  different 
life  distributions.  Various  predictions  of  the 
s-th  failure  time  Xs  (s  >  n)  based  on  the 
first  n  failure  times  are  obtained. 


(NBU),  if  F(x+y)  F(x)  F(y),  for  all  x,  y  >0. 

We  will  say  F  belongs  to  the  class  Cj  if  F  is 
either  NBU  or  NWU.  We  will  say  F  belongs  to  c^  if 

r(x)  =  f {x)(F(x))~^,  F{x)  >  0,  is  either  monotone 
increasing  or  decreasing. 

For  various  life  distributions,  the  distri¬ 
butional  properties  of  the  n-th  survival  time 
are  discussed.  Linear  prediction  of  the  s-th 
failure  time  based  on  the  first  n  (n  <  s) 
failure  time  is  given. 


2.  INTRODUCTION 

We  consider  a  repairable  system  in  which  a 
failed  component  is  replaced  immediately  by  a 
component  of  equal  age  and  the  system  becomes 
operational.  Let  us  denote  by  Xg,  Xj^,  X2,..., 

the  failure  times  of  the  system  where  Xg  =  0. 

The  time  between  failures  U„  =  X  -  X„  , ,  n  >  1 

n  n  n-1*  — 

are  non  negative  random  variables.  Let 

F(t)  =  P(Uj  <  t),  for  t  >  0  and  F{t)  =  1  -  F(t). 

We  assume  that  F(t)  has  a  density  f{t)  with 

F(0)  =  0  and  r(t)  =  f(t)(F(t))"^  for  F(t)  >  0. 
The  function  r(t)  is  called  the  hazard  rate  and 
t 

R(t)  =  j  r{u)du  is  called  the  cumulative  hazard 
0 

rate.  Let  F^(t)  -  Pr(X  <  t)  and  fp(t)  =  F  (t). 
Then 

1  -  if  n  =  1 


3.  MAIN  RESULTS 


Let  i"j^j(t)  denote  the  hazard  function  of  X^, 
then 


,(t) 


F(2/t) 


for  all  t,  t  >  0. 


In  general 

r(n)(t)  <  '■(n-l)  t)  ,  n  >  2  for  all  t,  t  >  0 
where  ''(i) (t)  =  n(t) , 


Let  U^  =  X^  -  X^_j^,  n  =  1,  2,  3,  ..be  the  time 

between  n-th  and  (n-l)-th  failures.  Suppose 
G^(t)  anJ  g^(t)  be  respectively  the  probability 

distribution  function  and  probability  density 
function  of  U„.  Then  we  can  write 


=  r(x)  +  F(x)  R(x)  ifn  =  2 
and  in  general 

n-1  .  , 

1  -  =  F(x)  2.  (R(x))'  (i:)-^ 

^  ^  i=0 

1  -  F^^j(x)  can  he  interpreted  as  the  survival 

time  to  the  n-th  failure  of  the  system  given 
that  the  failed  components  of  the  system  was 
replaced  by  one  of  equal  age  and  the  repair 
times  were  negligible.  The  probability  density 
function  (pdf)  fj,(x)  of  can  be  written  as 

F(p)(x)  =  f(x)  —  ,  n  >  1,  X  >  0 

=  0,  otherwise. 

If  F  is  the  distribution  function  of  a  non  nega¬ 
tive  random  variable,  we  will  call  F  is  'new 

better  than  used'  (NBU),  if  T{x*y)  £F(x)  F(y) 
for  all  x,y  2  0  and  F  is  'new  worse  than  used' 


Lemma  3.1 

If  F  belongs  to  Cj^,  then  E(U^)  £  {>_)  E(Uj^) 
according  as  F  is  NBU  (NWU). 

E(UJ  =rr  {r(n))-^  {R(u))"-1  f(u)  filial  dud z 
"  00  F(u) 

<  ( >)  J  /  (F  (n))-V(u))"‘^  f(u)  F(z)  dudz, 

0  0 

according  as  F(z+u)  £  {£)  F(u)  F(z).  Hence 
E(Un)  <_  (>)  E(Ui). 
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(a)  Uniform  Distribution: 

Suppose  the  random  variable  Uj  has  a  two 
parameter  rectangular  distribution  with  the 
following  probability  density  function 


f{x)  =  - —  ,  -  oo<a<B<<" 

B-a 

=  0,  otherwise. 

It  can  easily  be  shown  that 


^n(")  it 


n-1 


a  <  X  <  6 


The  best  linear  unbiased  predictor  of  is 

=  2""^  -  (2""®  -  1)  6  . 

(b)  Pareto  Distribution: 

Suppose  the  random  has  the  Pareto  distri¬ 
bution  with  the  following  cumulative  distribution 
function  F(x) 


E(X^)  =  2-"  a  +  (1-2-")6 


V{XJ  =  (3-"  -  4-")(B-a)2 


The  joint  pdf  of  X^,  (m  <  n)  is 


F(x)  =  1  -  (e/x)'’,  0  <  e  <  X,  v  >  0. 


The  pdf  f^{x)  of  X^  can  be  written  as 


n  1  „  v+1  „  1 

V«) --nsriTriO  (>"lx/e)l"-' . 


f  /x  .  1  1  .  1  1 

^m,n  '^m*  *r\'  (m-2)  1  (n-m-1)  1  B-a  ‘  B-x^ 

n-m-1 


m-2  6-x„ 

m  n 


<  X  X-  <  b 

m  n 


=  0,  otherwise. 


0  <  e  <  X  <°o 


=  0,  otherwise. 

E(X^)  =  e  .  for  V  >  1 

E(Xn‘')  =  .  V  >  k 


E  (X  X  =  X  )  =  2'"'"  X  +  (1  -  r~")  B 
n  m  m  m  '  ' 


X  =  X,  Uj  ,  n  >  1 
"  i=l  ’ 


Cov  (X„,  X„)  =  2”''"  Var  (X„)  ,  n  >  m. 
m  n'  m 


The  minimum  variance  unbiased  estimates  S,  8  of  a 
and  B  based  on  the  observed  values  Xp  X2,..., 

x^  of  Xp  Xj, . . . ,  X^  are 


a  =  2  Xj  -  B 


*  '  3(3"~^-l) 


4  /^n-l  1  ,n-2 


•  Vl  •••  -  2  Xj) 


1  3"-l  ,2 

V(a)  =  -as  -r— T (B-a) 

^  3""  -1 


2  3'-l 


V(B)  =  g 


9  -n-l 


3""‘-  1 


(6-a) 


Cov  (a,  b)  =  -  4  — -^-1 (b-o) 

y  3n-i_j 


where  =  denotes  the  equality  in  distribution  and 
Uj,  are  independent  and  Identically 

distributed  as  Pareto  distribution  with  the  fol¬ 
lows  pdf  f{x),  where 

f(x)  =  V  x~^'’  X  ^  0 

=  0  ,  otherwise. 


The  product  moments  of  X  and  X  (m  <  n)  can  be 
m  n  ' 

obtained  as  follows 

„  r  „  s  d  ^r+s  ,  v  ,,  r+s,  ,  "  s, 

X^  X^  =  e  (  T  )  U,  )  (  1.  U. 

and  thus 


i  =  l 


i=m+l  ' 


m  n-m 

E(X  X  -  6  ( — —)  (-T-) 

m  n-m 

Et*.  <„i  ■ «  'A' 


^  n-m 
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Table  1;  Variances  and  Covariances  of 


a 


m 

n 

2.5 

3 

3.5 

4.0 

4.5 

5.0 

1 

1 

2.2222 

.7500 

.3733 

.2222 

.1469 

.1042 

1 

2 

3.7037 

1.1250 

.5227 

.2963 

.1889 

.1303 

2 

2 

17,2840 

3.9375 

1.6028 

.8395 

.5074 

.3364 

1 

3 

6.1728 

1.6875 

.7317 

.3951 

.2429 

.1628 

2 

3 

28.8067 

5.9063 

2.2440 

1.1193 

.6524 

.4205 

3 

3 

103.5665 

15.6094 

5.1742 

2.3813 

1.3148 

.8149 

1 

4 

10.2881 

2.5313 

1.0244 

.5267 

.3123 

.2035 

2 

4 

48.0110 

8.8594 

3.1416 

1.4925 

.8387 

.5256 

3 

4 

172.6109 

23.4141 

7.2438 

3.1751 

1.6905 

1.0187 

4 

4 

565.5626 

55.3711 

14.8841 

6.0113 

3.0304 

1.7556 

Table  1  gives  the  variances  and  covariances  of 
^n  ”  2.5,  3,  3.5,  4,  4.5,  5.0  and 

1  £  m,  n  £  4  with  e  =  1. 

(c)  Exponential  Distribution: 

Suppose  the  random  variable  has  a  two 

parameter  exponential  distribution  with  the 
following  pdf  f(x) 

f(x)  =  exp(-  (x-u))  , 

for  X  >  u,  o  >  0, 


Var{vi)  =  m  o^/{m-l) 

Var(o)  =  o^/(m-l) 

Cov(u,  o)  =  -  o^/(m-l)  . 

The  best  linear  unbiased  lenear  predictor  X^  of 
X^  based  on  the  observed  failure  times  xj^,  x^, 

....  is 

Xj  =  ((s-1)  x^  -  (s-m)  Xi)/(m-l) 


=  0  ,  otherwise. 


MX, 


F(X 


L(s)' 


The  pdf  fp{x)  of  X^  can  be  written  as  follows: 


f,(x)  =  .  1- 

n  a 


E{X^)  =  u  +  n  O 


,  X  >  u 


V(Xj)  =  o2((m+s^)  -  2s)/(m-l) 
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THE  SIMULATION  OF  LIFE  TESTS  WITH  RANDOM  CENSORING 
Jos«ph  C.  Hudson.  GMI  Engln««ring  &  Mana.g»mnt  Instltut* 


Abstract 

This  paper  considers  the  simulation  of 
life  tests  In  which  n  Items  are  placed  on 
test  and  remain  until  removed  by  either 
failure  or  random  censoring.  The 
censoring  mechanism  Is  taken  to  be 
Independent  of  the  failure  mechanism. 
Simulation  Is  done  under  the  constraint 
that  the  number  of  Items  censored  Is  a 
Binomial  random  variable.  allowing 
simulations  to  be  run  specifying  the 
expected  percentage  of  censored  Items. 

Details  of  the  Implementation  are 
discussed  and  a  validation  study  Is 
presented.  The  simulation  Is  Implemented 
In  Pascal. 

Introduction 

The  developmnt  of  techniques  for 
reliability  data  analysis  requires  data 
from  known  distributions  for  empirical 
validation  and  comparison  studies.  Such  a 
need  motivated  the  work  reported  in  this 
pap>er.  Randomly  censored  failure  data  was 
needed  from  a  sptectrum  of  short  and  long 
tailed  distributions.  The  algorithm 
presented  here  simulates  data  from  tests 
In  which  n  Items  are  placed  on  test.  Each 
Item  remains  on  test  until  either  failure 
or  removal  from  test  by  a  random 
censoring  mechanism  Independent  from  the 
failure  mechanism. 

Simulations  are  carried  out  using 
Welbull.  uniform,  truncated  normal  and 
truncated  Caucny  failure  distributions. 
The  censoring  distribution  Is  taken  to  be 
exponential.  With  user -sped fled  failure 
distribution  and  probability  of  censoring 
p^,  the  mean  of  the  censoring 
distribution  Is  determined  to  enforce  the 
constraint  that  PC  T^j,  <  Tf^  D  «  Pe. 
tdiere  and  are  the  censoring  and 

failure  times  of  the  !*■*'  Item, 
respectively.  In  F’*ff®*'ming  the 

simulation.  a  failure  time  and  a 
censoring  time  are  Independently 
generated  for  each  Item,  with  the  smaller 
of  these  times  taken  as  the  time  of 
removal  from  test.  Time  of  and  reason  for 
removal  from  test  are  rep>orted  for  each 
1  tern. 

Use  of  the  simulation  procedure 
Involves  the  following  steps; 

1.  Choose  a  failure  distribution  from 
the  truncated  Cauchy,  truncated  normal , 
uniform  or  Welbull  families. 

S.  Choose  a  probability  of  censoring 
and  find  the  mean  of  the  censoring 
distribution. 

3.  Choose  the  sample  size  n.  Generate 
the  random  sample  using  the  following 
procedure  for  each  Item; 

a.  Randomly  generate  a  failure  time. 

b.  Randomly  generate  a  censoring 


time. 

c.  The  smaller  of  the  two  times 
determines  the  stopping  event,  failure  or 
censoring.  Record  the  type  of  event  and 
the  time  of  occur ranee. 

Each  of  these  steps  will  be  discussed 
below. 

Choice  of  Failure  Distribution 

The  distributions  available  were 
chosen  to  offer  both  long  and  short 
tailed  alternatives  to  the  Welbull.  The 
Cauchy  and  normal  distributions  are 
truncated  at  0  to  avoid  negative  failure 
times.  Since  step  2  is  Implemented  with 
general  procedures,  the  list  of  available 
failure  distributions  can  be  expanded. 

Finding  the  Mean  of  the 

Censoring  Distribution 

The  censoring  time  for  the  1^^  Item  on 
test.  Tj.^,  follows  the  exponential 
distribution  with  density 

f-Ct3  =  -  e-l^«,  t  >  O. 

For  brevity,  the  1  subscript  Is 
supressed.  For  given  m  and  failure  CDF 
FfCt5,  the  probability  that  the  1‘^  event 

Is  a  censoring  Is 

*  00 

PC(u5  «  PC  Tc  <  T,  3  «  rPCTf>t3feCt3dt 


-  ^jCl-F,Ct3)e-‘''**dt.  C13 

o 

A  representative  graph  of  PCp3  is  shown 
In  figure  1.  PCp3  has  a  number  of  useful 
properties; 

lim  PCp3  =  1 
fj  —*  o* 

11m  PC^i3  »  O  C23 

— >00 

PCp3  Is  monotonl cal 1 y  decreasing  In  p. 

Proof  Is  straightforward.  These 
properties  guarantee  a  unique  solution 
to  the  equation 

PCp3  -  p^.  C3D 

Is  found  using  the  secant  method 
CHornbeck.  [1S7S13  modified  as  shown  In 
figure  2.  The  modification  involves  the 
behavior  of  the  points  Cp^.PjD  and 
Cp2>Pz^  used  to  define  the  secant  line. 
The  relationship  p^  >  Po  Is  maintained  to 
keep  the  point  Cpj,Pj3  to  the  left  of  the 
goal  Cp^.p^?.  This  Insures  that  the 
sequence  of  Pj  values  monotonl cal 1 y 
approaches  p^.  at  the  expense  of 
computation  time.  The  procedure 
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tarmlnatcs  v4i«n  th«  ralatlvft  dlff*r*nc« 

b*twiMn  Pi  And  b*low  a  sp«ciri«d 

•rror  tolerance. 

Each  Iteration  of  the  secant  method 
requires  evaluation  of  Cl^  to  find  p^.  We 
briefly  discuss  the  procedure  for  doing 
this  for  each  of  the  failure 
distributions. 

If  Tf  has  a  uniform  distribution  on 
ta.bl .  a  >  O. 


PCp3  -  1  + 


p_r  -bxp 
b-aL 


C45 


Solution  of  C33  for  Cp^,p^3  proceeds 
odthout  difficulty. 

If  Tf  has  a  truncated  normal 
distribution  with  Cpre  truncation^  mean  M 
and  standard  deviation  o’. 


where  Atan  Is  the  Inverse  tangent 
function  and  Is  the  probability  that 
the  pretruncation  failure  random  variable 
Is  negative.  This  case  Is  processed  as 
the  Welbull  .  with  the  Integral  In  C93 
approximated  by 

Je"^ Atan  dt  + .  5e"’‘  ^^Atan  j  ^  1 03 

o 

where  x  >  satisfies  the  inequality 

E  S  -  Atan[4^]]  C113 

with  E  the  desired  error  tolerance  in  the 
Integral  approximation,  x  Is  found  using 
the  secant  method. 


PCp3  -  1  - 

oa 

[px(eCl-P„3yS?  CS3 

o 

«diere  P^  Is  the  probability  that  a  normal 
random  variable  with  mean  M  and  standard 
deviation  e  ts  negative.  C53  may  be 
reduced  to 

PCp3  -  l-PCZ>A3e“*^'^'*’®*'^^xXl-P„3  C83 

where  A  •  e/'p  -  and  Z  is  the  standard 

normal  variate.  C03  may  be  readily 
evaluated  using  an  adaption  of  Maclaurln 
series  and  continued  fraction  expansions 
of  the  error  function  CNonweller  Cl 0841 3. 
Hudson  ClOeO]  gives  details. 

If  Tf  has  a  three  parameter  Welbull 
distribution  with  minimum  life  6, 
characteristic  life  0  and  shape  parameter 
ft,  then  ^ 

PCp3  -  1  -  Je"^e"^*^'^®^^dt  j .  C73 

o 

The  Integral  In  C73  may  be  replaced  with 
an  Integral  with  finite  limits  using  the 
r  el  atl  onshl  p 


K 


-lnCaE3 


-t  -CtA^yey 


dt  % 


J*"'* 


-Cpt/'©3' 


dt+E.  C83 


The  error  of  approximation  using  C83  is 
less  than  E.  so  C33  may  be  solved  for  p^ 
to  the  desired  error  tolerance  by 
assigning  a  portion  of  the  error 
tolerance  to  E.  The  numerical  evaluation 
of  the  Integral  In  C83  Is  carried  out 
using  adaptive  Simpson's  quadrature  with 
Richardson's  Improvement  CMarlon  C108713. 
The  variant  used  Is  shown  In  figure  3. 

If  Tf  has  a  truncated  Cauchy 
distribution  with  pretruncation  median  a 
and  scale  parameter  b,  then 


pCpD  -  1  -  - 

nC  1  -P„3 

OD 

- - -  fe'^Atanfi^^ldt.  C03 

nCl-P„3  J  ^  ® J 

o 


Generating  the  Random  Samples 

The  sampling  procedure  requires  n 
random  observations  each  from  the 
censoring  and  failure  distributions.  To 
generate  these,  the  output  of  a  uniform 
CO, 13  random  number  generator  is  shuffled 
using  algorithm  B  of  Knuth  C1981,  pg32] 

with  an  auxiliary  table  of  length  117. 
The  resulting  uniform  CO, 13  random  number 
Is  converted  as  needed  using  standard 
transformations  for  the  uniform,  Welbull 
and  Exponential  cases  C Hastings  and 
Peacock,  C 107533.  Normal  deviates  are 
generated  using  a  ratio  method,  algorithm 
R  of  Knuth  C1981,  pg  125].  Cauchy 
variates  are  generated  using  the  ratio  of 
two  Independent  standard  normal  deviates 
CHastlngs  and  Peacock,  C197S,  pg  4233. 

Validation  Stialy 

The  algorithm  Is  Implemented  In 
Pascal.  To  verify  the  Implementation,  lOO 
samples  of  lOO  Items  each  were  generated 
for  each  of  the  24  failure  distributions 
shown  In  table  1.  The  estimated  value  of 
Pq  generated  by  the  samples  and  the  value 
of  p^  found  by  the  algorithm  and  used  in 
the  sampling  procedure  are  also  shown. 

If  the  sampling  procedure  performs  as 
designed,  each  of  the  24  sets  of  lOO 
samples  Is  a  sample  of  size  lOO  from  a 
binomial  distribution  with  n  >  lOO  and  p 
the  probability  of  censoring,  either  .1 
or  .9.  Chi  square  goodness  of  fit  tests 
were  carried  out  to  test  this  hypothesis 
against  Its  negation.  11  cells  were  used 
for  each  test,  giving  10  degrees  of 
freedom.  The  resulting  values  and 

their  P  values  are  shown  In  table  1. 
Figures  4  and  5  show  the  best  and  worst 
cases  from  among  the  24. 

The  24  observations  of  should 

constitute  a  random  sample  from  the  chi 
square  distribution  with  10  degrees  of 
freedom  If  the  algorithm  Is  properly 
Implemented.  An  additional  study  of  the 
ordered  7^  values  did  not  reveal  any 
grouping  or  unusual  patterns  among  these 
24  values. 
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Table  1.  Summary  of  the  validation  study. 


Dlst 

Cauchy 


Normal 


Oil  form 


Wei bull 


Ref 

Pe 

Sample 

A'c 

P 

No. 

Medi  an 

Shape 

Est  of 

Pc 

lOdf 

Value 

1 

10.000 

2 

.  1 

.  1003 

04.028 

8.  91 

.  970 

2 

10.000 

2 

.O 

.8007 

4.342 

lO.  87 

.  077 

3 

10.000 

10,000 

.  1 

.lOOO 

220.378 

9.  94 

.  892 

4 

10.000 

10.000 

.9 

.8990 

3.482 

8.  31 

.990 

Mean 

St  Dev 

9 

10.000 

2 

.  1 

.1014 

04.012 

12.  70 

.  237 

9 

10.000 

2 

.9 

.8989 

4,343 

3.  13 

.  078 

7 

10.000 

10.000 

.1 

.1040 

119.740 

0.  99 

.  720 

8 

10.000 

10.000 

.9 

.9022 

2.810 

0.  00 

.  797 

MlnLlfe 

MaxLlfe 

e 

O 

10,000 

.  1 

.  1009 

40.008 

9.  19 

.914 

lO 

0 

10.000 

.9 

.9080 

1  ,ooo 

17.  80 

.  098 

11 

0.000 

10,000 

.  1 

.  loie 

04.438 

7.  70 

.  002 

12 

0.900 

10,000 

.  9 

.9011 

4.321 

0.94 

.  708 

MlnLlfe 

Slope 

CharLlfe  -  10,000 

13 

0 

0.9 

.1 

.0090 

140.074 

8.  90 

.  980 

14 

0 

0.9 

.0 

.  0019 

140 

7.  30 

.601 

19 

0 

1.9 

.1 

.  0081 

83,004 

2.  lO 

.009 

10 

0 

1.9 

.0 

.8940 

2.009 

8.  45 

.989 

17 

0 

4.0 

.  1 

.0082 

89,070 

4.  72 

.  900 

18 

O 

4.  0 

.9 

.  8089 

3,930 

lO.  07 

.  008 

19 

9.000 

0.  9 

.  1 

.  1009 

290.214 

8.  80 

.  543 

20 

0.900 

0.  9 

.9 

.  9020 

0.398 

15.  21 

.  129 

21 

0.000 

1 . 5 

.  1 

.  1003 

178,090 

3.  19 

.  077 

22 

9.000 

1.9 

.9 

.  9004 

7.370 

O.  83 

.  741 

23 

0.000 

4.  0 

.  1 

.  1009 

170,821 

0.  41 

.  780 

24 

0,000 

4.  0 

.  0 

.  8079 

8,061 

8.  91 

.970 

Figure  1.  PCm3  for  the  uniform  C490.900]  failure  distribution. 


721 


start 


Figure  2.  Modified  secant  method  used  to  find 
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set  counts  and  sums  to  0 
Set  Currint  to  CLL.ULl 


Currint  Is  the  subinterval 
currently  being  processed. 
CurrLeft  and  CurrRl ght  are 
the  left  and  right  halves 
of  Currint. 


oppi y  Simpson's  3  point  quad¬ 
rature  to  Currint  to  get  CurrEst 


create  CurrLeft  and  CurrRl ght  and  apply  Simpson's 
3  point  rule  to  estimate  Integrals  over  each 


use  Richardson's  Improvement  to  get 
CurrVal  ,  compute  CurrErr  and  CurrAccErr 


I  Cur r Err  I 


CurrVol  estimates  the  Integral  over 
Currint  to  at  I eost  OCh*).  CurrErr 
estimates  the  error  In  this  approx¬ 
imation.  CurrAccErr  Is  proportional 
N.  to  the  width  of  Currint. 


<  CurrAooErry 


add  Integral  and 
error  estimates 
to  accumul ators 


stock 

empty 


/  can 
CurrLeft'' 
be 

di  vided  > 

V 


record  error 
I nf or mat I  on 


push  CurrRl ght 
onto  stack 


make  Currint 
•  CurrLeft 


pop  new 
Currint 
from 
stock 


report 
resul tj 


Figure  3.  Adaptive  Simpson’s  Quadrature  applied  to  JfCtSdt. 
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Weibuil  Failure  Distribution 


Mu  83604,  Pc  .1.  ML  0.  CL  10000,  S  1.5 


Mirter  of  Cansorad  Items 


4.  The  best  fit  from  the  validation  study. 


Uniform  Failure  Distribution 
Mu  1000,  Pc  .9,  Min  0,  Max  10000 


Ninter  of  Censored  Items 


S.  The  worst  fit  from  the  validation  study. 


An  l(iontinal>Ie  Mode!  for  Informalivc  Censoring 
William  A.  l/mk,  U.S.  Kish  and  Wildlife  Service,  Patuxent  Wildlife  Res«;arch  (/enter 


I.  IN'rilODUCTION 


'rhe  usual  niodel  for  censored  survival  analysis  of  a 
“lifetinie’'  T  is  that  observations  arc  of  the  form  (X,^)  where 
X  <  T  and  6  takes  the  values  0  and  1,  depending  on  whether 
X  <  7'  or  X  =  'r,  respectively.  A  great  deal  of  attention  has 
Ix’en  given  to  the  “independent  censoring  modc)'^  in  which  it  is 
assumed  that  X  =  min('r,('),  with  C  (referred  to  as  the 
“censoring  variable”)  assumed  to  be  statistically  independent  of 
the  lifetime  under  consirleration.  ‘I'liis  model  is  appealing 
because  of  its  rpialitative  ‘unplicity  and  mathematical 
tractability. 

Under  the  indejx'ndent  censoring  model,  (lie  Kaplan-Meier 
estimator  (Ki\fK)  (Kaplan  and  Meier,  1958)  is  the  appropriate 
estimator  of  the  survival  function  S(t)  =  P(T  >  t).  Fblde.s 
and  Rejtd  (1981)  estabJishctl  strong  uniform  consistency  of  the 
KMK  and  Cili  (1983)  provided  weak  convergence  results  on  the 
entire  positive  half-line. 

.Analyses  based  on  the  usual  model  may  yield  unreliable 
results  if  the  inde{>en<lent  censoring  assumption  is 
inappropriate.  As  a  hypothetical  e.xample,  consider  a  radio 
telemetry  study  of  the  life  expectancy  of  a  mallard.  Censored 
ob.servalions  would  occur  upon  failure  of  the  radio  tran.smitler. 
If  failure  of  the  transmitter  were  related  .solely  to  the  reliability 
of  the  unit,  the  event  of  censoring  could  safely  be  assumed  to 
carry  no  information  about  the  status  of  the  bird.  However  if 
failure  of  the  unit  were  due  to  predation,  censoring  would  Ik* 
equivalent  to  death  of  the  bird.  Between  these  two  extremes  is 
a  wide  range  of  j>ossiblc  mcKlols  in  which  tl»c  event  of  censoring 
carries  at5  unfavorable  prognosis  for  future  survival.  In  such  a 
ra>o  the  KMK  will  tend  to  overestimate  the  true  survival 
probabilities. 

Unfortunately,  if  the  only  olxservations  available  are  the 
pairs  (X.(^).  the  independence  as.Minii)tion  is  completely 
nnte.siable.  It  has  be<*n  shown  by  Cox  (1959)  and  'Isiatis 
{1“75)  tiiat  “there  always  exist  inde[)en<lent  censoring  nuKlels 
consistent  with  atiy  [irobabilify  distribution  for  the  observable 
|)air  (X,*^)”  (Lagakos.  1979).  f'onseqiient  ly,  if  the 
indeprtjdejire  a.ssuiiipf ion  is  deemed  inaf)propria(e,  analysis 
must  rely  on  equally  iififestable  model  a.s.sufnption.s  and 
(p<'rha[»s)  fh<’  ';bser\afi<»n  of  c^nariafes. 

1  )je  r»-,jfler  is  referrei)  to  tin*  work  of  Williams  and  Uagakos 
(1977)  and  Lagakos  and  Williams  (197H)  a.s  well  a.s  that  of 
Hobertsoii  atnl  Upjnii.jri  (19^1)  for  examples  of  such  inetho^ls. 
In  the  foriiHT.  it  is  ;i.ssiime<l  that  the  hazard  function  of  the  X*s 
is  r*-lafe<l  to  that  of  the  1  ’s  l»y 


where 


I  li<*y  suggest  a[>f>roximaf ing  r(f)  by  a  step  futif  lioii  taking  k 

rlisfinc)  vabjes  Cj . 'I  heir  proeedure  for  eslimaling  S(  ) 

under  the  rone  rla.ss  model  jtssiiiiipt ion  invf)Ues  an  arbitrary 
clioice  of  tin-  number  and  length  of  lliese  intervals  and.  in 
.ulditiori.  specifying  that  Sf  • )  is  a  niemi)er  of  some  parainetrie 


family  of  distributions.  Estimation  of  the  parameters 

corre.spoiuling  to  S(*),  of  Cj,...,C|^  ,  and  of  0,  is  llien 
accomplished  by  maximum  likelihood.  Explicit  solution  of  llio 
inaxiiiuim  likelihood  eqtiations  is  not  pos.sible;  their  method 
“requires  a  great  deal  of  computer  lime.” 

Another  proce<lnre  for  obtaining  alternative  estimators  to  the 
KME  has  been  proposed  by  Rol>ert.son  and  Uppuluri  (198‘1). 
I'heir  procedure  i.s  based  on  a  modifjcalion  of  the  well-known 
“redistribute  to  the  right”  algorithm  (Efron,  1967).  There  do 
not  appear  to  Ik*  easily  constriiclible  models  justifying  most 
redistribution  schemes. 

There  would  appear  to  be  a  need  for  some  simple  models, 
and  corresponding  survival  function  estimators,  applicable  when 
censoring  carries  an  unfavorable  prognosis  for  future  survival, 
fn  this  paper  we  prof>ose  a  model  in  which  censoring  can  occur 
only  in  a  “high-risk”  subpopulation.  This  model  suggests  a 
moilification  of  Efron's  self-consistency  algorithm  which  leads 
to  our  Modified  Kaplan-Mcicr  estimator  (MKMP>). 


2.  THE  MODEL 

Suppose  that  a  lifetin)e*  “T"  wllh  survival  function  S(t)  is  the 
object  of  investigation,  that  for  each  lifetime  tliere  is  a  binary 
Covariate  “7”,  with 

Q(t)  =  l>  (t  >  t  I  -,  =  1)  =  (S(l))'" 

for  .some  m>  1.  It  follo^vs  that 

U(l)  s  !•  (  T  >  t  1  .)  =  0)  =  • 

where  p  =  K(7  =  1).  .Since  ~dR(t)/<ll  must  be  non-negative, 
tin*  valne.s  of  m  and  p  are  restricted  by  mp<  1. 

In  this  mo<lel.  lifetimes  with  covariate  7  =  1  have  a  hazartJ 
function  Aq(-)  =  tnA(*),  where  A(-)  is  the  population  hazard 
function.  'I'hus  tjje  covariate  7  divitles  the  population  into 
“high-risk”  and  “low-risk”  subpopulalions. 

Ixt  (T,.-)]),(Tj.->2) . (  In.ln)  111’  «  saiii|.li-  of  llii-si’ 

lifetimes  an<!  their  rorres|Kuuling  covariates,  1  urthermore.  let 
n  1”'  a  sample  of  “potential  censoring  times." 
independent  of  tlie  corresponding  I ’s.  fly  this  we  mean  that 
observations  are  of  the  form 

an.l  (2.1) 

=  (1  -  1,)  +  1,1  (1 1  <  •  1)  • 

where*  l(  • )  Is  tin*  imlicator  funclion.  J  ims  ernsoring  is  only  a 
possibility  in  the  higl)  risk  subpopnlal loii . 

Use*  of  the  KMl'i  Utiele-r  this  riuxle*!  h'ads  on  e-rrst  imal  es  <if 
S(1)  (si*e‘  li.5).  In  the*  se’epje)  we  sliaM  coiisiei*T  an  altcrnaliNc 
•’sliinafor  ba-see)  <ui  a  ninelifie  al  ion  rtf  )  fron's  Self  ( 'onsisti  iicy 
.\lgi*ritlim  whicli  is  ;«]>prnpriate  iind'  r  tlijs  nexlel. 
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3.  SURVIVAL  FUNCTION  ESTIMATOR 


Letting  0  =  XQ<Xj<X2<  -  <Xn  represent  the  ordered  times 
of  observation  and  '‘®pr«s€nt  the  corresponding 

values  of  6y  the  KME  is  tne  unique  limit  (as  K  -♦  oo)  of  the 
sequence  of  functions  obtained  by 


S(K+X)^^) 


1 

n 


)  +  i;(i 
'  1  =  1  ^ 


Thus  the  KME  satisfies 


Given  a  value  of  m,  the  parameter  c  can  be  estimated  by 


c(m)  = 


E  Sm(Xj) 


which  is  to  say  that  the  estimated  probability  of  survival 
beyond  time  “t”  is  the  percentage  of  observations  (censored  or 
uncensored)  beyond  time  “t”  plus  the  estimated  percentage  that 
would  have  survived  beyond  time  “f'  but  were  censored  before 
“t”.  The  KME  is  said  to  be  “self-consistent”  because,  in  the 
independent  censoring  model,  for  Xj  <  t, 

P(T>,|X  =  ., . 


where  7qv  i^  the  value  of  y  corresponding  to  Xj,  and  SmC*)  is 
the  MKmE  obtained  using  the  giv?n  value  of  m.  The 
pait.:r-c*..:  p  is  estimated  by  p  =  E^/n,  so  that  given  a  value  of 
c,  and  using  (4.1),  an  estimate  of  m  can  be  obtained  as 


m(c) 


2p 

1  -  2c(l  -  p) 


-  1  . 


It  is  easily  verified  that  there  exists  uniquely  a  pair  (mo,Co) 
satisfying  simultaneously  c(mo)  =  c©  and  rh(co)  =  nio-  These 
can  be  found  by  a  variety  of  numerical  methods,  such  as 
repeated  substitution.  Since  the  model  rastriclions  of  §2  require 
suggest  the  use  of 


Under  the  model  discussed  in  §2, 


fr>t|x  =  vi  =  o)=  |gi|  , 


suggesting  that  the  self-consistency  algorithnt  be  replaced  by 

tn. 


S 


The  following  facts  regarding  this  sequence  of  survival  function 
estimators  are  proved  by  Link  (lf)8fi)  ; 


1)  For  each  fixed  tn.  the  sequence  ^hiis  defined 
converges.  Furthermore,  if  the  original  e.stimalor  S'"  places  no 
weight  on  censored  observations  or  l>eyond  the  range  of 
observations,  the  limit  is  unifjue.  'I’his  will  lie  caiie<i  the 
MKME,  and  denoted  by  Sf^. 

2)  For  mj  <  1112-  Smj(<)  >  Sm.,(0-  Thus  the 

KME  (obtained  using  rn=l)  bounds  these  estimators  from 
above,  while  the  empirical  survival  function  of  the  X’s 
(obtained  using  large  values  of  m)  bounrls  them  from  Iwlow. 


4.  ESTIMATION  OF  MODEL  PARAMETERS 


In  this  section  we  shall  show  that  if  the  c«)varialr  *)  is 
f)bs«Tval)lr,  an  estimate  of  the  |)aratiH'ler  m  is  availalde. 

Krotu  the  firfinition  of  X  (2.1)  we  find  that 

|.(  S(X,  <  -  I  .  =  »)  = 


t  <  (0.1)  . 


Letting  c 


K^S(X)l7=()  w-e  hav*‘ 


m 


min 


{  m©  .  (P)'^  }  ■ 


as  the  estimator  of  the  model  parameter. 


5.  SIMULATION  RESULTS 


Rather  than  consider  a  sj^ecific  survival  function  S(*).  we 
generated  pairs  (11^,7)  where  y  is  a  Bernoulli  variable  with 
parameter  p,  and  where  I’-p  satisfies 


I>(1  _  U-r  <  X  1  7  =  1)  =  x" 


an<i 


P(1  li-^  <  X  1  7  =  0)  = 


(1  -  p) 


0  <  x<  1. 

The  Uy'.s  can  be  thought  of  as  the  quantiles  of  a  random 
sample  from  an  arbitrary  continuous  survival  function  S(-)- 
Observations  of  the  form  (Uj^.6)  were  then  obtained,  where 

'7  =  ('  -  l)  '’t  +  ■>  ">'"{>  T  •  '7}  ■ 

aiul 

=  1  —  7  >  I’c) 

arul  the  fK»f»nfial  censoring  variable  ,  generaliMl 

in<le|x‘nrlrnl ly  of  1’-^  and  7.  satisfies  P(U(-  >  x)  =  x^. 

0  <  X  <  1,  where  o  >  0  is  a  s[H*cined  constant. 

In  order  to  investigate  the  large  sample  behavior  of  the 
MKME,  a  sample  of  size  100(1  was  generated  using  p  -  .■>.  m 
-  2,  an<l  o  r-_  .(I2r).  yielding  an  expected  censoring  rate 
.2.'ML  Lhe  results  are  summarized  in  Table  1.  The  quantiles 
of  the  K M  F.  obtained  bir  these  <lata  are  also  included.  It  is  seen 
that  tinder  this  mo<|e|  the  KMI'  seriously  ov <'resf ioi.-jt es  irtie 
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survival  probabilities. 


Table  1. 

Estimates  of  quantiles  of 

m=2,  p=.4, 

n=  1000 

t 

kme 

mkme 

t 

kme 

mkme 

.05 

.0444 

.0470 

.55 

.4758 

.5530 

.10 

.0922 

.1001 

.60 

.5211 

.6003 

.15 

.1362 

.1510 

.65 

.5581 

.6377 

.20 

.1738 

.1961 

.70 

.6155 

.6930 

.25 

.2226 

.2,572 

.7.5 

.6762 

.7485 

.30 

.2731 

.3177 

.80 

.7299 

.7950 

.35 

.3127 

.3656 

.85 

.7779 

.8348 

.40 

.3574 

.4189 

.90 

.8612 

.9002 

.45 

.3834 

.4494 

.95 

.9284 

.9498 

.50 

.4266 

.4989 

In  arldition,  a  limited  Monte  Carlo  study  was  carried  out  to 
investigate  the  sampling  distribution  of  in.  The  estimates 
given  in  Table  2  were  obtained  by  generating  100  samples  of 
size  100  in  the  manner  descril)ed  al>ovc.  It  appears  that  the 
sampling  distribution  of  m  is  skewed  to  the  right  and  that  m 
tends  to  slightly  overestimate  tn. 


Table  2.  Tstimatrs  of  Sfean  h  Standard  Deviation  of  m.  (n=I00) 


m 

P 

Censonny  Rale 

^f€an 

St  Dev. 

1.25 

0.50 

0.284 

1.318 

0.182 

1.50 

0.50 

0.247 

1.564 

0.204 

1.75 

0.50 

0.248 

1,819 

0.235 

2.00 

0.75 

0.117 

2.223 

0.498 

2.00 

0..50 

0.234 

2.106 

0.306 

3.00 

0.70 

0.116 

3.414 

0.730 

6.  DISCUSSION 

The  mo<Iel  considered  in  this  article  offers  an  alternative  to 
the  usual  independent  censoring  mcKlel.  Censoring  is  a 
IK)ssibility  only  in  a  subpopulation  whose  hazard  function  i.s  in 
times  that  of  the  population  at  large.  The  parameter  rn  needs 
only  to  In'  non-n#’gative;  values  of  m  <  I  desrrilx'  riuKiels  in 
which  the  event  of  cen.soring  carries  a  favorable  prognosis  for 
further  survival. 


If  the  covariablc  7  is  not  ob.served,  a  (weak)  upper  bound  on 
the  range  of  possible  values  of  m  can  be  obtained  by  noting 
that 

- 1 -  >  - 1 -  >  fjj 

PiS  -  0)  -  P{7  =  1)  - 


and  estimating  =  0)  in  the  obvious  way. 

The  author  wi.shes  to  thank  Christine  Ilunck,  Nancy  Coon, 
Paul  Ceissler,  Thomas  Mathew,  and  Kenneth  Pollock  for  their 
careful  review  and  valuable  editorial  suggestions. 
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NONPARAMETRIC  REGRESSION  AND  SPATIAL  DATA: 
SOME  EXPERIENCES  COLLABORATING  WITH  BIOLOGISTS. 

Douglas  Nychka,  North  Carolina  State  University 


1  Introduction 

The  widespread  use  by  scientists  of  personal  computers 
for  data  collection  and  analysis  gives  the  statistician  more 
oppurtunity  to  become  closely  involved  in  a  research 
project.  This  paper  will  describe  two  projects  in  which  I 
have  collaborated  with  biologists.  The  main  point  is  a 
simple  one.  Computer  resources  can  significantly  improve  a 
collaborative  relationship  between  a  statistician  and  an 
experimenter.  This  can  happen  in  at  least  two  ways:  1) 
special  purpose  statistical  software  can  be  developed  to 
guide  experimenters  in  analyzing  their  data  2)  the 
statistician  can  be  involved  in  the  data  collection  by  helping 
to  develop  the  software  used  to  collect  and  store  the  data. 

This  discussion  will  be  organized  by  considering  a 
simple  model  (  in  the  social  science  sense)  that  outlines  the 
interaction  between  the  statisitician  and  the  scientist.  The 
next  section  briefly  discusses  the  components  of  this  model 
and  Sections  3  and  4  give  specific  examples  from  two 
research  projects.  The  first  project  concerns  the  estimation 
of  fitness  surfaces  for  a  song  sparrow  population  based  on 
the  sparrows  ability  to  survive  over  the  winter.  In  this  case 
the  fitness  surface  is  the  probability  of  a  sparrow’s  survival 
as  a  function  of  several  body  measurements.  One  success  in 
this  project  was  making  some  specific  nonparametric 
regression  software  available  to  the  biologist  so  he  could 
carry  out  most  of  the  analysis  of  his  data  on  his  own.  The 
second  project  studies  the  spatial  distribution  of  air  plants 
(epiphytes)  in  the  canopy  of  Costa  Rican  rain  forests.  This 
analysis  depends  on  constructing  a  three-dimensional 
“map”  of  the  canopy  trees  and  of  the  locations  of  epiphytes 
using  numerous  sightings  from  a  transit,  1  have  participted 
in  the  data  collection  by  developing  software  for  a  PC  that 
estimates  the  xyz  coordinates  of  points  in  the  canopy  from 
the  raw  angular  measurements.  It  is  important  to  be  able 
to  generate  these  tree  maps  right  after  a  day  of  field  work 
because  they  serve  as  a  check  on  the  measurements  and  will 
direct  further  .subsanipling  of  the  trees'  branches.  Another 
aspect  of  this  project  is  to  involve  this  botanist  directly  in 
the  spatial  .iiaiysis  of  the  canopy  data.  One  way  of 
accomplishing  this  goal  is  to  design  a  small  set  of  macros 
and  compiled  fu^'ctions  that  make  it  possible  to  carry  out 
of  the  anlaysis  witnin  the  S  statistical  package. 

Although  these  two  projects  arc  used  as  examples  of 
collaborative  work,  they  both  contain  novel  statistical 
applications  that  are  of  interest  in  their  own  right.  A  reader 
who  is  not  particularly  interested  in  collaborative  aspects  is 
still  encouraged  to  look  at  Scction.s  3  and  4  for  their 
statistical  content. 

2  Role  of  the  biologist  and  the  statistician 

Several  different  roles  of  the  statistician  and  the 
biologist  are  shown  schematically  in  Figure  1.  The 
relationship  shown  in  the  first  diagram  is  a  model  for  short 
term  statistical  consulting.  The  scientist  both  collects  and 
analyzes  his/her  data  following  the  advice  of  a  statistician. 
Here  the  statistician  plays  a  passive  role  and  becomes 
involved  in  the  research  project  only  through  the  scientist. 
The  next  diagram  is  an  improvement  and  indicates  the 
roles  often  filled  by  thc.se  two  people  in  a  collaborative 
effort.  The  scientist  concentrates  on  data  collection  while 
the  statistician  is  mainly  Involved  in  the  .statistira]  analy.sis 


ot  the  data.  There  two  potential  disadvantages  with 
this  separation.  The  statistician  may  not  gain  a  true 
appreciation  for  the  data  that  have  been  colected  while  the 
analysis  performed  by  the  statistician  may  remain  slightly 
mysterious  to  the  experimenter.  An  important  contribution 
of  a  statioiician  is  in  the  design  of  experiments  and  this 
would  imply  a  link  between  the  statistician  and  data 
collection.  Finally,  it  also  important  for  the  scientist  to  be 
involved  in  the  analysis  of  the  data.  The  last  diagram  lias 
added  these  two  links  and  completes  the  possiblites  for  a 
full  collaboration.  This  paper  will  argue  that  it  is  possible  to 
foster  these  two  nonstandard  links  in  the  final  diagram 
throught  the  use  of  appropriate  software  on  a  personal 
computer.  The  following  sections  give  some  specific 
examples  of  how  this  was  accomplished. 

la  Short  Term  Consulting 

DATA  COLLECTION -  SCIENTIST 

- " "  L 

DATA  ANALYSIS  '  STATISTICIAN 

lb  Limited  Collaboration 

DATA  COLLECTION  - SCIENTIST 

DATA  ANALYSIS  -  STATISTICIAN 

Ic  Full  Collaboration 

DATA  COLLECTION  - ^SCIENTIST 

DATA  ANALYSIS  STATISTICIAN 

Figure  1.  Some  different  roles  for  a  scientist  and  a 
statistician  in  a  reascarch  project. 

3.1  The  overwinter  survival  of  Juvenile  song  sparrows 

This  project  is  based  on  the  research  by  Dolph  Schluter 
and  James  N.M.  Smith  ,  Department  of  Zoology, 
University  of  British  Columbia  on  the  song  sparrow 
population  of  an  island  off  the  coast  of  Britsh  Columbia 
(Schluter  and  Smith  1986).  The  interest  in  this  project  was 
to  indentify  what  characteristics  in  a  song  sparrow  improve 
its  chances  for  surviving  over  the  winter  season.  Before  the 
winter  for  the  years  1974  -  1979  the  juvenile  song  sparrow 
population  on  Mandarte  island  was  exhaustively  sampled 
by  capturing  the  birds  in  mist  nets.  Six  morphologic 
measurements  were  made  on  each  bird  involving  the  size  of 
llie  body  and  beak.  The  same  survey  was  done  after  the 
winter  and  the  birds  not  pre.sent  on  the  i.sland  at  that  time 
were  recorded  as  nonsurvivors.  One  way  of  quantifying 
survival  is  by  a  fitness  function.  If  x  are  the  .six 
morphological  inea.suretnents  made  on  a  sparrow  let  p(x) 
denote  the  probability  that  this  individual  will  survive 
through  the  winter.  The  functional  form  for  p  is  not  known 
and  thus  these  ecologists  feel  it  is  important  to  be  able  to 
cslimale  p  without  having  to  assume  a  specific  parametric 
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model.  Note  that  this  problem  does  not  fit  into  the  ordinary 
nonparmetric  regression  setting  because  the  independent 
variable  is  a  0  or  1  response  and  the  variance  of  this 
response  depends  on  p(x).  Also,  there  are  obvious 
constriants  on  p:  0<p(x)<l. 

When  I  was  first  contacted  by  Dolph  Schlutcr  he 
already  had  an  interest  in  analyzing  these  fitness  data  using 
nonparametric  methods.  One  possible  way  of  estimating 
the  fitness  function  is  by  a  penalized  likelihood  approach 
and  the  details  of  a  spline  method  are  described  in  the  next 
subsection.  The  immediate  problem  was  finding  software 
that  would  run  on  his  IBM  AT.  If  the  fitness  function  only 
depends  on  one  independent  variable  then  it  is  possible  to 
compute  p  using  a  modest-size  FORTRAN  program.  Dolph 
Schluter  was  able  to  use  the  program  that  I  wrote  to 
investigate  the  effect  of  the  morphologic  variables 
separately.  Also,  by  having  the  numerical  portion 
available,  he  was  able  to  spend  time  on  a  user  friendly  shell 
to  call  these  the  numerical  routines.  The  software  resulting 
from  our  combined  efforts  was  not  only  statistically  sound 
but  could  also  be  used  by  other  ecologists  with  minimal 
introduction.  The  use  of  these  nonparametric  methods  for 
fitness  data  has  been  subsequently  described  in  Schluter 
(1988). 

n^ure  2  is  an  example  of  this  nonparametric  method 
for  estimating  survival.  Plotted  arc  the  responses  (0,1)  for 
151  juvenile  male  song  sparrows  against  the  standardized 
second  principle  component  of  the  morphologic 
measurements.  The  smooth  curves  are  the  estimated 
probabiltie^  of  survival  for  different  amounts  of  smoothing. 
The  solid  curve  in  this  group  is  the  spline  estimate  where 
the  amount  of  smoothing  was  determined  objectively  from 
the  data  by  cross  validation. 

Estimating  a  fitness  surface  is  more  complicated  mainly 
because  of  the  multivariate  nature  of  the  problem.  One 
possible  solution  is  to  approximate  the  fitness  surface  using 
the  representation  from  projection  pursuit  regression 
(Freidman  and  Stcutzle,  1981).  The  key  to  this 
representation  is  the  identification  of  linear  combinations  of 
the  original  measurements  that  give  a  better  explanation  of 
a  song  sparrow’s  survival.  This  approach  is  appropriate 
because  it  is  reasonable  to  expect  survival  to  depend  on  a 
chararteristic  that  is  a  combination  of  the  morphological 
measurements.  Let  f(x)  =  l”(p(x)/(  l-p(x))  be  the  logit  of 
p(x).  One  nonparametric  representation  for  f  is 

J 

(3.1)  f(x)  =  £;g.(a.‘x) 
j=l 

where  the  vectors  of  coefneients  are  chosen  so  that 
I"  model  one  must  not  only  estimate  the 

ridge  functions  gj  hut  also  the  projections,  a..  Although  this 
adds  more  structure  to  the  statisitieal  method,  there  are 
computational  advantages  hecause  one  only  needs  to 
consider  a  nonparametric  estimates  of  curves  rather  than 
an  estimate  of  a  surface. 

Figure  3  gives  some  results  for  the  male  jnnvenile 
sparrows  for  two  ridge  functions  (  J  =  ‘2),  The  amount  of 
sm(K>tliing  used  in  this  estimate  is  the  .same  a.s  that  used  for 
the  .solid  curve  in  Figure  2  (in(A)  %  -4).  IMotted  are  the 
probability  contours  of  the  fitne.ss  surface  as  a  function  of 
the  two  variables  obtained  from  the  two  e.stimated 
projections,  Hj  and  ^2-  I  he  shape  of  this  surface  is  a  saddle 
where  thr-  first  variabh*  fia.s  a  stabilizing  iiinuerire  on 
survival  while  the  second  is  disruptive.  One  interesting 
feature  of  this  estimate  is  that  (he  first  projection  yields  a 


linear  combination  of  the  original  measurements  that  is 
similar  to  the  second  principle  component.  This  second 
principle  component  can  be  identified  with  the  relative  size 
between  the  body  and  head. 

Projection  pursuit  estimation  especially  in  the  context 
of  generalized  additive  models  is  computationally  intensive. 
The  estimate  plotted  in  Figure  3  took  approximately  one 
half  an  hour  on  a  VAX  750.  However  it  is  a  mistake  to 
retreat  from  a  method  that  is  just  beyond  the  power  of  a 
PC.  Given  the  current  trend  for  more  powerful  personal 
computers,  it  is  largely  a  matter  of  time  before  these 
methods  will  be  feasible.  Also,  the  amount  of  time  it  takes 
to  perform  a  statistical  analysis  is  often  evaluated  on  the 
wrong  scale.  The  fitness  data  described  in  this  section  was 
accumulated  over  the  course  of  five  years  at  a  remote  site. 
Even  if  a  statistical  analysis  takes  several  days  on  a  PC  this 
is  a  modest  amount  of  lime  compared  to  the  effort  spent  on 
collecting  these  data. 

3.2  A  Dooparametne  estimate  of  the  fitness  curve  using  a 
smoothing  spline 

First  assume  that  only  one  morphological  measurement 
is  taken  and  let  l<k<n  be  the  observed  data 

where  indicates  nonsiirvival  and 

corresponds  to  survival  of  the  juvenile  sparrow.  The  log 
likelihood  for  these  data  is  porportional  to 

^  l  ln(p(X|^))y|j+  ln(l-p(x,^))(l-y|^)] 
k=l 

Now  let  (^(u)  =  e“/(l+c“)  be  the  logistic  link  function  and 
let  f(x)=  ln(p(x)/(  l-p(x)).  With  this  parametrization 
p(x)=  i^(f(x))  and  has  the  advantage  that  the  range  of  f 
docs  not  have  any  constraints.  The  log  likelihood  now  has 
the  form: 

E  '■(>‘k)yk+  ’'■’('■(’‘k)) 

k=i 

with  V’(u)=  ln(l /( I +c'*)). 

Now  if  this  expression  were  maximized  over  all  functions,  f, 
the  solution  would  degenerate  to  a  function  where  f(X|^)  = 
oo  if  yj^  =  l  and  f(xj.)=  -oc  when  y|.=9-  Cleary  this  is  not  a 
suitable  estimate.  One  reason  that  result  is  not  appropriate 
is  because  one  expects  some  continuity  or  smoothness 
between  values  of  f  f  and  p) 

for  similar  values  of  x.  This  assumption  implies  that  the 
survival  of  a  sparrow  is  a  continuous  function  of  the 
morphological  measurements.  One  way  of  incorporating  this 
information  is  to  penalize  the  likelihof)(i  when  f  is  rough. 
For  example,  j'f^^(u)’‘^du  is  one  overall  summary  of  the 
nirvature  of  f  and  will  be  large  wlien  f  is  very  wiggly  and 
small  a.s  f  l)ecomcs  linear.  In  this  work,  the  estimate  of  f  i.s 
found  by  maximizing 

k=l  •' 

over  all  functions  where  lif"  (n)“du  <  X  .  Note  that 
although  it  is  llie  logit,  f.  (hat  is  being  es(iina(<*d  directly, 
the  estimated  survival  probabilites  ar<’  found  !>>'  the 
trarisfcirrrialion  p(x)=  <!>{  f{x)). 

Surprisingly  t  Ins  estimate  is  not  difficult  (o  compute 
and  ha.s  the  same  form  as  an  ordinary  rnhic  snuxithing 
spline.  That  is,  the  estimated  curve  has  two  ctmtiinious 
derivatives  ami  can  be  expressed  as  a  piecewise  cubic 
polyiminial  with  join  points  at  (x|.),  l<k<n  (see  I'ubank. 
Ifl.SK  for  an  ifi(rt)du(tion  to  (his  siihjert).  Although  (hi.s 
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maximum  must  be  found  using  an  iterative  procedure,  each 
iteration  is  eiTicient  because  it  only  requires  the  smoothing 
of  a  set  of  psuedo-observations  using  an  weighted,  cubic 
smoothing  spline.  This  algorithm  is  very  simlar  to  the 
iteratively  reweighted  least  squares  approach  used  to 
estimate  the  parameters  in  generalized  linear  models. 

The  smoothing  parameter  A  controls  the  relative 
weight  given  to  the  roughness  penalty  and  the  log 
likelihood.  Note  that  when  A  is  very  small  the  estimate 
will  fit  the  data  well  but  may  not  be  very  smooth.  At  the 
other  extreme  when  A  is  very  large  the  estimate  will 
approach  a  staighl  line  where  the  slope  and  intercept  are 
the  usual  maximum  likelihood  estimates.  The  effect  of 
varying  A  is  shown  in  Figure  2. 

So  far  the  discussion  has  focused  on  computing  a  spline 
estimate  for  a  fixed  value  of  A.  Because  the  estimated  curve 
is  sensitive  to  the  choice  of  this  parameter,  it  is  important 
to  be  able  to  estimate  an  appropriate  value  objectively  from 
the  data.  One  way  to  accomplish  this  is  by  cross  validation. 
Let  denote  the  spline  estimate  of  p 

for  a  particular  value  of  the  smoothing  parameter  having 
omitted  the  k***  observation,  (y|5,X|^)  .  If  this  value  for  A  is 
a  good  one  then  “on  the  average”  should  be 

“close”  to  the  omitted  observation,  yj^.  This  correspondence 
can  be  quantified  by  the  cross  validation  function; 


V,„(A) 


V  <yk-PA,k(^k)>^ 

k=l  (^■PA,k(’'k)'PA,k(’'k) 


where  the  denominator  in  this  sum  of  squares  adjusts  for 
the  different  variances.  With  this  criterion  a  data-based 
estimate  of  the  smoothing  parameter  is  the  value  that 
minimizes  Vet,  (  see  Yandell,  et  al.,  1984  for  more  details). 
At  first  glance  it  may  appear  that  Ven  will  be  very 
expensive  to  compute.  However,  by  considering  a  linear 
approximation  based  on  the  estimate  of  p  for  the  full  data 
set  and  some  efficient  routines  for  cubic  splines  (Hutchinson 
and  Dehoog  1985}  this  cross  validation  function  can  be 
computed  easily  on  an  IBM  PC. 

3.3  Projection  Pursuit  Estimation 

In  the  the  same  manner  as  the  univariate  spline 
estimate,  the  projection  pursuit  estimate  of  the  fitness 
surface  will  be  defined  as  the  maximizer  of  a  penalized 
likelihood.  Recall  that  from  (3.1)  f(x,)  'vill  depend  on  i 
pairs  of  univariate  functions  and  projections.  For  a  fixed 
set  of  projections  and  for  some  A>0  the  estimates  of  gj  are 
taken  to  be  the  functions  that  minimize: 

(3.3)  E  f(x,.)y|.+  (/-(^xij))  -  E  /(gj'(u)^<lu 
k=l  “  “  j=l-'  ■' 

such  that  ^(gj'(u)^du<oo,  l<j<J. 

In  this  way  any  set  of  projections  will  determine  an 
estimate  for  the  surface.  In  ordinary  projection  pursuit 
regression,  one  chooses  a  set  of  projection  vectors  by 
minimizing  the  residual  sum  of  squares.  In  this  case,  it  is 
natural  to  consider  a  weighted  sum  of  squared  residuals 
(which  will  also  be  close  to  the  deviance).  Let  p(x)  denote 
the  probability  surface  corresponding  to  the  estimated  logit 
function  for  a  fixed  set  of  projections.  Weighting  residual.s 
by  the  estimated  .standard  deviation  of  yj^. 


(3.4) 


k=l 


(yk~p(^k))^ 

(l'P(Xk))P(Xl() 


The  estimates  of  ap...,aj  are  given  by  those  vectors  that 
that  minimize  R.  An  outline  of  the  algorithm  used  to 
perform  this  minimization  is: 


Inititialization:  gj=0, 

Repeat  until  convergence: 

Do  1=1,  J 

Fix  a-  for  jyil  and  minimize  R  with  respect  to  a, 

(coarse  search  on  sphere  refined  using  the  simplex  method) 

Note  that  if  only  one  projection  is  allowed  to  vary  then  the 
maximization  to  estimate  gj  is  just  the  one-dimensional 
spline  smoothing  problem  described  in  the  previous  section. 

If  this  algorithm  converges  then  the  limit  will  be  a  solution 
to  the  maximization/minimization  problems  stated  above. 
lAVien  this  algorithm  will  converge  is  still  an  open  question. 

In  retrospect  the  penalized  likelihood  suggests  a  more 
direct  method  for  estimating  the  projections.  Since  the 
penalized  likelihood  in  (3.3)  depends  both  on  the  ridge 
functions  and  the  projections  it  is  reasonable  to  maximize 
this  functional  jointly  over  these  two  components.  It  is 
possible  that  this  unified  approach  will  be  better  than 
estimating  the  projections  based  on  residuals.  Ordinarily 
the  minimization  over  the  projections  does  not  account  for 
the  smoothness  of  the  implied  ridge  functions.  As  the 
dimension  of  >(  increases  it  becomes  easier  to  find  a 
projection  that  fits  the  data  well.  The  limited  experience 
from  this  survival  data  is  that  the  estimated  projections 
may  yeild  rough,  possibly  spurious  estimates.  This  may  be 
due  to  the  fact  that  the  minimization  of  R  does  not  make 
any  adjustment  for  projections  that  give  very  rough 
estimates  of  the  ridge  functions  but  nevcr-the-less  fit  the 
data  well.  By  considering  the  projections  that  maximize  the 
penalized  likelihood,  the  roughness  penalty  may  help  to 
control  this  effect. 

4.1  Spatial  distribution  of  epiphytes  in  the  tropical  canopy 

One  important  difference  between  tropical  and 
temperate  forests  is  where  nutrients  are  stored.  Although  in 
a  temperate  forest  most  of  the  nutrients  are  found  in  the 
soil  in  a  tropical  rain  forest  a  significant  portion  of  the 
nutrients  are  stored  as  biomass.  One  important  component 
of  this  biomass  are  the  epiphtyes  and  dead  organic  matter 
in  the  tree  canopy.  Nalini  Nadkarni  al  the  Biology 
Department  at  the  University  of  California,  Santa  Barbara 
is  interested  in  studying  the  role  that  the  canopy  plays  in 
nutrient  cycling.  This  research  has  practical  implications 
because  as  rain  forest  is  cleared  for  agricultural  use  the 
canopy  is  destroyed  and  thus  the  normal  nutrient  cycle  is 
interrupted. 

A  first  phase  of  this  project  is  to  quantify  the 
architecture  of  trees  that  make  up  the  canopy  and  to 
determine  how  different  epiphytes  are  distributed 
throughout  this  region  of  the  forest.  Until  recently  because 
of  its  height,  the  canopy  was  inaccessible  to  researchers. 
However,  by  using  mountain  climbing  equipment  it  i.s  now 
po.s.sible  to  reach  the  canopy  by  ropes  move  safely  within  it  . 
The  observational  data  can  he  thought  of  as  a  three- 
dimensional  map  giving  the  spatial  locations  of  branches 
and  other  features  within  the  canopy.  With  this  type  of 
data  one  ran  then  hwk  for  patterns  in  the  epihpytir 
distribution  and  test  for  preferential  .sites  or  for  competition 
among  different  .species.  Al  a  more  fundaTiiental  level  otic 
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can  study  how  nutrients  percolate  down  from  the  top  of  the 
canopy  to  the  forest  floor. 

My  collaboration  in  this  project  started  with  designing 
a  method  to  collect  canopy  data.  The  problem  is  to 
determine  the  three  dimensional  coordinates  of  features  in  a 
tree  without  having  to  climb  to  each  location  of  interest. 
Also  once  a  method  has  been  developed,  it  is  important  to 
quantify  its  error.  Our  final  solution  was  to  use  the  parallax 
vciw  provided  by  a  transit  at  two  locations.  Figure  4  is  a 
diagram  of  the  geometry.  (The  mathematical  details  are 
given  in  the  next  section.)  To  find  the  coordinates  of  some 
target  in  a  tree,  a  transit  is  used  to  find  the  horizontal  and 
azimuthal  angles  from  two  vantage  points  that  arc  separted 
by  a  short  distance  (approximately  three  meters).  The 
target’s  position  is  estimated  by  the  midpoint  of  the 
shortest  line  segment  that  connects  the  lines  of  sight  from 
the  two  transits.  An  estimate  of  error  is  the  length  of  this 
segment  (see  Figure  4).  Transforming  the  angular 
measurements  at  two  vantage  points  the  xyz 

coordinates  is  too  complicated  to  do  by  hand  but  makes  for 
a  short  program  on  a  PC.  Figure  5  is  a  draftsman's  view  of 
a  tree  mapped  by  this  procedure  including  the  locations  of 
two  kinds  of  epiphytes. 

Besides  working  out  the  geometry,  part  of  my  role  was 
to  provide  the  field  assistant  with  software  to  compute  the 
tree  map  coordinates  at  the  end  of  a  day  of  taking 
sightings.  In  effec  t  I  have  some  participation  in  how  these 
data  are  collected  since  these  programs  can  incorporate 
logic  to  spot  inconsistencies  or  flag  estimated  locations  that 
have  large  estimated  errors.  The  most  frustrating  situation 
is  when  a  bad  observation  is  identified  only  after  returning 
from  Costa  RicaJ  One  use  for  these  tree  maps  is  to  aid  in 
subsampling  a  tree  crown  for  the  detailed  investigation  of 
specific  branches.  This  is  another  area  in  which  I  can  be 
involved  in  data  collection.  PC-based  software  can  be  used 
to  guide  the  choice  of  subsampics  in  a  manner  to  insure  a 
good  experimental  design. 

Another  aspect  of  our  collaboration  is  having  Dr. 
Nadkarni  (and  her  research  assistants)  particiaptc  in  the 
spatial  analy.sis  of  the  epiphyte  locations.  Unlike  the  project 
in  the  previous  section  little  new  software  is  needed.  Rather 
one  needs  to  integrate  a  few  special  purpose  functions  into 
an  existing  statistical  package.  For  example,  the 
draftsman’s  view  of  a  tree  in  Figure  5  w2ls  drawn  using  a 
specially  written  macro  in  S.  I'he  advantage  is  that  most  of 
the  scientist's  effort  will  be  spent  learning  a  standard 
package.  This  is  better  than  having  to  deal  with  a  special 
(and  perhaps  idiosyncratic)  program  that  only  performs  a 
specific  analysis  of  the  data.  Besides  exploratory  graphic.^, 
testing  hypotheses  about  the  spatial  distributions  of 
epiphytes  can  also  be  based  in  S.  To  do  this  one  needs  an 
additional  S  function  that  simulates  the  distribution  of 
epiphytes  on  the  tree  network  according  to  some  null 
hypothesis.  For  example,  suppose  one  wanted  to  test 
whether  the  epiphytes  were  uniformly  distributed  on  the 
branches  of  a  tree.  One  selects  a  test  statistic  that 
measures  uniformity  and  calculates  the  value  of  thi.s 
statistic  using  the  coordinates  of  the  observed  data.  In 
order  to  calculate  the  reference  distribution  for  this  statistic 
one  simulates  samples  whose  coordinates  are  uniformly 
distributed  on  the  tree  network  and  for  each  of  lhe.se 
sitniuated  sampifs  the  same  test  .statistic  is  computed.  By 
generating  a  a  large  number  of  samples  (  several  hundred) 
one  can  estimate  the  distribution  of  the  lest  statistic  under 
the  hypothesis  that  the  epiphyte  positions  are  uniformly 
distribiited.  To  do  a  hypothesis  lest,  one  compares  the 
oo^elvcd  value  of  the  test  stati.stir  with  the  distribution 
determined  from  these  simtilations. 


4.2  Tree  mapping. 

In  this  section  a  derivation  is  given  for  estimating  the 
coordinates  of  a  target  from  the  direction  cosines  measured 
at  two  vantage  points.  To  simplify  this  discussion  it  will  be 
assumed  lijat  the  origin  is  at  the  center  of  the  first  transit 
while  the  cooidiiiates  of  the  second  transit  are  d=(D,0,h). 
(D  is  the  horizontal  distance  between  transits  while  h  is  the 
difference  in  elevation.)  The  horizontal  and  vertical  angles 
measured  by  the  transit  are  taken  to  have  the  same  sense 
as  9  and  in  a  spherical  coordinate  system.  Thus,  if  0  and 
are  the  pair  of  angles  measured  from  the  transit  to  the 
target  then  a  unit  vector,  e,  in  this  direction  has 
components; 

(cos(0)sin(<?!>),  s\n(6)s\n(<i>),  cos(<^)). 

Let  a  and  b  denote  the  directions  to  a  particular  target 
from  the  two  vantage  points  of  a  transit.  The  rays 
respresenting  the  line  of  sights  can  be  parametrized  by  aa 
and  d  -f  /?b  .  One  estimate  of  the  target  position  the 
midpoint  of  the  shortest  line  segment  joining  these  tw’o 
rays.  To  find  this  point  let  a  and  0  be  the  values  that 
minimize: 

(4.1)  |aa  -  (d  +  ,Jb)p 

Setting  first  partial  derivatives  equal  to  zero  yeilds  the 
system  of  equations 

a  -  = 

-Qt7  +  /?=  -U2 

where  7=  (a)'^(b),  Uj=  (a)^((j)  and  u.2=  (b)"^(d). 

Thus 

0=  (Uj--)rU2)/(l-7^).  0=  (7U^-U2)/(1-7^) 
and  the  estimated  target  position  is: 

or. -f  [da  -  (d  4-  ^b)]]/2. 

The  squared  length  of  the  line  segment  is  found  by 
substituting  d  and  0  into  (4.1). 

As  a  final  note,  care  should  be  taken  in  interpreting  the 
line  segment  length  as  an  absolute  mea.surc  of  the  estimated 
position’s  accuracy.  It  is  a  biased  estimate  of  distance 
between  the  estimated  position  and  the  actual  one. 
.Simluations  indicale  that  in  the  situations  encountered  in 
mapping  tree  positions  the  median  segment  length  is 
typically  about  2/3  the  actual  distance. 
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Figure  2.  Over  winter  survival  of  male  juvenile  song 
sparrows  as  a  function  of  the  standardized  second 
principle  component.  Plotted  points  are  the  actually 
survival  of  151  birds  (  0=  nonsurvival,  1=  survival). 
The  curves  are  the  estimated  probabilites  for  survival 
for  different  amounts  of  smoothing.  In  the  solid  curve 
the  smoothing  pararineter  has  been  chosen  objectively 
from  the  data  using  cross  validation. 
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Figure  3.  Over  wu^ter  survival  surface  estimated  using 
projection  pursuity^*l'unclion  of  two  projections  of  the 
morphologic  measurements.  Plotted  are  the  probability 
contours  for  a  surface  consisting  of  two  ridge  functions. 
The  first  projection  (x  axis)  is  very  similar  to  the  second 
principle  component  described  in  Figure  2. 
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line  of  sight 


Figure  4.  Geometry  of  measuring  position  from  the  angular  measurements  of  two  transits 
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Figure  .5.  Tree  map.  Orafisinan  view  (  (op  and  two  side  views  at  right  aiigle.s)  of  a  pasture  tree 
from  a  .study  .site  in  (.’o.sta  Ilira.  'Die  .seale  is  in  meters  and  the  plotted  syinlH>ls  in<li«al4'  the  loeatiiin 
of  iiromeliads  (B  D)  and  niisthdoes  (Nt). 
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SPACE  BALLS! 

OR 

ESTIMATING  THE  DIAMETER  DISTRIBU  HON  OF  MONOSIZE  POLYSTYRENE 

MICROSPHERES 

Susannah  B.  Schiller,  National  Bureau  of  Standards 


Introduction. 

Polystyrene  microspheres,  with  nominal  diameters  in 
the  range  of  0.3  to  30  microns,  have  been  certified  by  the 
National  Bureau  of  Standards  as  Standard  Reference 
Materials.  Some  of  them  were  manufactured  in  space  on 
the  shuttle  Challenger  because  the  beads  are  more  uniform 
in  size  and  shape  when  made  in  zero  gravity.  They  provide 
an  important  tool  for  calibrating  instruments  that  are  used 
to  examine  very  small  particles,  such  as  blood  cells, 
bacteria,  or  airborne  dust.  In  order  to  be  useful  for 
calibration,  their  diameters  must  be  well-characterized. 

To  certify  these  SRMs,  the  beads  were  put  into  a 
suspension  which,  when  dried,  caused  the  beads  to  form 
chains  on  a  microscope  slide.  Parallel  light,  projected  up 
through  the  slide,  marked  the  center  of  each  sphere  in  the 
common  back-focal  plane  on  which  the  microscope  was 
focused.  Because  the  "chained"  spheres  touched,  the 
distance  between  sphere  centers  on  photomicrographs 
gave  a  good  estimate  of  sphere  diameter  [1].  In  order  to 
get  the  desired  accuracy  for  certification,  the  scientists  had 
to  make  careful  and  tedious  measurements  on  thousands  of 
pairs  of  spheres.  For  the  users  of  these  SRMs,  who  only 
want  to  verify  that  their  mean  measurements  fall  within  the 
certified  bounds  of  uncertainty,  a  quicker  approach  is 
desirable. 

The  proposed  technique  uses  closely  packed  hexagonal 
arrays  of  the  microspheres  instead  of  chains.  Row  lengths 
are  measured  between  the  centers  of  the  end  spheres.  The 
obvious  diameter  estimate  is  the  average  center-to-center 
distance,  found  by  dividing  the  row  length  by  the  number  of 
spheres  in  the  row  minus  one.  However,  because  the 
diameters  are  not  identical,  there  are  always  air  gaps  in 
these  arrays  which  inflate  the  diameter  estimates.  These 
air  gaps  cannot  be  measured  via  the  center  distance  finding 
technique,  nor  have  they  been  modelled  mathematically. 
Additionally,  there  is  the  problem  of  the  "scrunching  factor," 
or  Van  der  Waals'  attractions.  When  two  objects  touch, 
they  flatten  by  some  factor  which,  in  the  case  of 
polystyrene  microspheres  of  the  size  under  consideration 
here,  is  about  0.1%  .  This  factor  was  easily  taken  into 
account  for  the  pairwise  measurements  of  microspheres 
arranged  in  chains,  where  it  was  known  that  every  pair  of 
spheres  touched.  However,  in  an  array  where  many  pairs 
of  spheres  do  not  touch,  the  analysis  is  much  more  difficult. 

Simulation. 

The  approach  taken  to  this  estimation  problem  was  to 
simulate  packed  arrays  of  spheres  and  determine  the 
behavior  of  the  air  gaps.  From  the  chain  measurements,  it 


was  known  that  the  diameters  followed  a  normal 
distribution,  and  that  the  standard  deviations  were  roughly 
1%  of  the  mean  diameter.  To  simulate  this,  arrays  of  circles 
whose  diameters  came  from  the  normal  distribution 
a2)  vvere  generated.  These  were  "packed"  by 
minimizing  the  sum  of  squared  distances  between  centers 
of  neighboring  circles  subject  to  the  following  constraint;  if 
packing  caused  a  pair  to  touch,  they  were  forced  to  overlap 
by  exactly  0.1%  of  the  average  of  the  two  diameters 
involved.  Otherwise,  an  air  gap  was  left  whenever  the 
centers  were  more  than  the  average  of  the  two  diameters 
apart. 

Multiple  simulations  were  performed  for  each  of  the 
following  combinations  of  N,  PD  and  o,  where  N  is  the 
number  of  circles  in  an  array: 


N  =  81  PD  =  1 
N  =  64  PD  =  1 
N  =  64  PD=  1.5 
N  =  64  P  D  =  2 


a  =  0.009  to  0.015  by  0.001 
a  =  0.008  to  0.015  by  O.OOi 
CT  =  O.OOSpD  to  0.01 5pD  by  0.001  PD 
o  =  0.009pD  to  0.012pD  by  0.00  IpD 


Each  array  was  laid  out  in  a  "square"  fashion,  with  'JW=K 
columns  and  K  circles  per  column  (Figure  1). 


Figure  1 

Simulated  Hexagonal  Array 
N  =  81,  p  =  1,0  =  0,01 
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The  arrays  were  packed  by  minimizing  the  sum  (over  the 
entire  array)  of  the  squared  gaps  between  neighboring 
circles,  subject  to  not  allowing  the  circles  to  overlap,  using 
the  routine  E04VCF  in  NAG.  Larger  arrays  caused  this 
routine  to  fail,  and  smaller  arrays  were  deemed  too  small  to 
be  useful.  Results  were  output  in  the  form  of  center 
coordinates  and  a  diameter  for  each  circle  in  the  array.  The 
distance  between  each  neighboring  pair  of  circles  was 
found,  and  row  lengths  R  (between  the  centers  of  the  outer 
circles  of  each  row)  were  measured  in  all  three  possible 
directions.  Only  rows  with  a  fixed  number  of  circles,  K, 
were  considered  for  either  the  row  lengths  or  pairwise 
distances.  Average  center-to-center  distances  were 
computed  from  the  row  lengths: 


and  were  averaged  for  each  array. 

Gap  Frequency  and  Size  Distribution. 

A  colleague  has  conjectured  that,  assuming  the 
variability  among  diameters  is  "small,"  the  minimum 
proportion  of  gaps  possible  in  an  array  of  circles  is  the 
number  of  interior  circles 


divided  by  the  total  number  of  neighboring  pairs.  For  the 
case  of  the  "square"  array,  this  is: 

(K-2)2 

Pm-(K-l)  (3K  1) 

This  means  that  the  minimum  percent  of  gaps  for  the 
simulated  arrays  should  be: 

K  Percent  Gaps 

8  22.36% 

9  23.56% 

The  conjecture  serves  as  a  useful  guide.  For  all  relative 
standard  deviations  considered,  an  air  gap  <“xisted  between 
a  pair  of  circles  about  25%  of  the  time  in  the  simulated 
arrays.  It  might  be  noted,  however,  that  this  minimum 
proportion  can  be  greater  than  25%  for  larger  arrays.  For 
example,  a  14  x  14  array  should  have  gaps  between 
neighboring  circles  at  least  27%  of  the  time,  based  on  the 
conjecture. 

The  gaps  appear  to  follow  a  gamma  distribution. 
Histograms  of  the  gaps  were  overlaid  with  plots  of  gamma 
probability  density  functions  having  the  same  mean  and 
variance,  and  the  fit  was  remarkably  good  (Figure  2).  To 
date,  no  theoretical  reason  has  been  determined  for  this 
occurrence. 


Figure  2 

Fit  of  Gamma  Pdf  to  Histogram  of  Gaps 


a  =  0.009 


(7  =  0,01 


738 


The  mean  and  standard  deviation  of  the  gaps  depend 
upon  the  standard  deviations  of  the  circle  diameters,  but 
these  statistics  are  quite  variable  between  simulated 
arrays.  This  is  to  be  expected,  because  gap  sizes  depend 
on  the  overall  layout  of  the  array,  not  just  on  the  two 
diameters  on  either  side.  For  example,  if  the  same  N  balls 
were  arranged  differently  in  the  square  array,  the 
optimization  would  produce  a  totally  different  set  of  gaps. 
Bearing  in  mind  this  variability,  we  found  empirically  that 
both  the  gap  average  and  standard  deviation  can  be 
approximated  as  multiples  of  the  diameter  standard 
deviation  (Figures  3  and  4); 

G=  1.3443  Sd  (2) 

Sg«  1.1277  Sd  (3) 

Figure  3 

Average  Gap  =  1.3443  *  Diameter  Standard  Deviation 


Models  for  Center-to-Center  Distance  Mean  and 
Variance. 

Using  information  about  the  average  frequency  with 
which  gaps  occur  and  their  size  distribution,  functional 
relationships  between  the  diameter  mean  and  standard 
deviation  and  the  array  center-to-center  distance  mean  and 
standard  deviation  can  be  found. 

A  full  model  for  the  average  center-to-center  distance, 
C,  for  a  row  of  K  balls,  is: 


1  >  pi+Di-^l 

- +ZiGi-0.' 


OOl(l-Zi) 


Di+Di+1 


where: 


Dj  -  N  (|ii2),  o2)  the  circle  diameters 

Gj  -  r(a,P)  the  gaps  (hg  =  «P.  ^G  = 
Zi~B(l,p)  a  binomial  random  variable 

oenoting  whether  or  not  a  gap 


>  0.020  ■ 
<. 


*V  ‘  **« 
Vvw  ♦ 


. 


This  leads  to  a  very  messy  computation  for  the 
variance,  especially  if  all  of  the  possible  covariances 
between  random  variables  are  considered.  There  is  some 
correlation  between  the  Dj  and  Gj  and  Zj  ,  respectively,  but 
it  is  small  and  will  be  disregarded.  A  further  simplification 
is  to  ignore  the  fact  that  the  Zj  are  random  variables,  and 
replace  the  Zj  in  the  formula  by  their  expectation,  p: 

1  >  Pi+Dj+i  Dj+Dj+il 

^  =  '^■^1 - 2 - +pGi- 0.001  (1-p)  ■  ^  ^  j  (5) 

This  gives 

EIC]  =  (0.999  +  0.001  p)Pd  +  pPG 
and 

1  ,  (2K-3)(0.999 -t- 0.001  p)2  . 


0.005  H — — I — ' — I — ■ — I — - — I — ■ — I 

0.005  0.010  0.015  0.020  0.025  0.050 

Diameter  Standard  Deviation 


Applying  equations  (2)  and  (3)  to  equations  (6)  and  (7) 
yields: 

C  =  (0.999  +  0.001p)D-t-1.3443pSD  (1* 

and 

A  1  ,,(2K-3)(0.999-t-0.001p)2 


-f  (1.1277p)2)S5} 

A  A 

We  can  also  estimate  C  and  Var  (C)  by 

c  =  jJ^Scj 


/(Cj-C)2 


where  M  rows  have  been  measured  and  divided  by  K- 1  to 
produce  the  Cj. 

Predicted  Diameter  Standard  Deviation. 

Of  course,  the  real  interest  is  in  finding  estimates  of  pp 
and  a  in  terms  of  C  and  S^;.  Eq.  9  suggests  that  a  model 
for  <T  should  look  like: 

o  =  aV(K-l)Var(C) 

However,  we  assume  that  the  logarithm  of  a  is  more  nearlj 
normally  distributed  than  o  itself,  so  this  fit  is  better  done 
on  the  log  scale: 

In  (o)  =  a  ln(V(K-l)Var(C)  )  +  b 
and  experimentation  showed  that  a  much  better  fit  is  found 
when  ln(C)  is  included  in  the  model.: 

In  (o)  =  a  ln('/(K-l)Var(C)  )  +  b  In  (C)  +  c  (10) 
Figure  5 

Predicting  Diameter  Standard  Deviation 


Estimators  Sp)  and  Sc  from  the  simulation  data  were  used 
for  a  and  V  Var(C)  when  fitting  this  model  Using  all  of  the 
data  from  the  described  simulations,  the  parameter 
estimates  from  this  fit  are  a  =  0.4282,  6  =  0.5061  and  c  = 
-2.3913.  To  find  a  95%  prediction  interval  for  o  we  first 
propagate  the  errors  in  C,  Sc,  and  the  parameter  estimates 
a,  6  and  £  to  estimate  the  variance  in  ln(&)  due  to 
uncertainty  in  the  model. 

Variance  Due  to  Model  = 


MSE  (xi  (X'X)-lx*)  +  b2 


MC2 


2(M-1) 


where  X  is  the  design  matrix,  MSE  is  the  mean  square 
error  from  the  fit  of  the  linear  model,  and 


x;  =  (  ln(^/0^Sc  )  In  (C)  1  ). 
The  total  variance  of  ln(d)  is: 


Var(ln(o))  =  MSE  (xi  (X'X)-lx*  +  l)+b2 


ii 

MC2 


■2(M-1) 


Assuming  that  enough  measurements  (M)  were  made  so 
that  two  standard  deviations  is  the  appropriate  width,  the 
95%  prediction  interval  for  !n(o)  is  given  by 

ln(o)  ±  2"'/  Var(ln(o) ) 
and  the  95%  prediction  interval  for  a  is: 

(exp(ln(a)  -  2V  Var(ln(c))  )  , 

exp(ln(o)  +  2V  Var(ln(a))  )) 


Figure  5  shows  the  fitted  data  and  prediction  limits  (all 
on  original  scale)  plotted  against  V(K-l)Sc.  2'he  plot  has 
been  broken  into  three  sections  for  the  three  nominal 
values  of  pp)  that  were  used  to  generate  the  simulation 
data. 


Predicted  Diameter  Mean. 

A 

The  estimator  for  C  suggests  a  fit  for  pp).  Equating 
Eq.  8  to  C  gives  a  function  for  D  in  terms  of  C,  Sp),  and  p: 

_  C  1.3443pSp) 

^  °  0.999+0.00 Ip  ■  0.999+0.(X)lp 
Applying  the  fitted  Eq,  10  gives  an  estimator  for  flp)  in 
terms  of  C,  p,  and  s/K-lSc  : 


A  -  C 

PD  - ‘^~(),999+(),{)01p  ■ 

{).999+0.(K)lp  (v  K-rSc)^  "^)  (11) 

When  a  function  of  this  fonn  was  fit  to  the  simulation 
data  directly,  it  was  found  that  the  power  to  which  C  was 
rai.sed  went  10  0.  Tlius,  a  function  of  the  fonn: 


a'C  b'p  (s/K-1Sc)‘^ 

"  0.999+0.001  p)  ■  (0.999+0,001  p) 
is  reasonable.  However,  the  residual  sum  of  squares  was 
virtually  the  same  for  this  fit  as  for  the  simpler  linear  model 


(1-a'nK'  b'ps/K-iSc 

(0.999+0.001  p)  ■  (0.999+0,001  p) 
so  the  latter  model  was  applied. 


(12) 
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Unfortunately,  p  cannot  be  estimated  from 
photomicrographs,  so  its  lower  bound,  determined  by  the 
number  of  circles  in  each  row  measured  (Eq.  1 ),  is  used. 
The  parameter  estimates  from  fitting  Eq.  12  with  the 
simulation  data  are  S'  =  0.0091  and  6"  =  0.9736.To  find  a 
95%  prediction  interval  for  pQ  we  again  propagate  the 
errors  in  C,  Sq,  and  the  parameter  estimates  S'  and  6"  to 
estimate  the  variance  in  (Id  due  to  uncertainty  in  the 
model; 

Variance  Due  to  Model  = 

/  .  s  ^  b'2p2(K-l)S^ 

MSE(x;(X'X)-lx,)+^l-a'p)2  +  ^mTi)- 

(0.999+0.00  lp)2 

where  X  ,s  the  design  matrix,  MSE  is  the  mean  square 
error  from  the  fit  of  the  linear  model,  and 


Thus,  the  total  variance,  V,  for  (l  is; 

52 

MSe(x;(X'X)-1x*  +1)  +-j^l-a'p)2  + 


b'2p2(K-l)S^ 
2(M-1)  ~ 


(0.999+0.001  p)2 


Assuming  that  enough  measurements  (M)  were  made  so 
that  two  standard  deviations  is  the  appropriate  width,  the 
95%  prediction  interval  for  p  is  given  by 

±2  VV 

Figure  6  shows  the  fitted  data  and  prediction  limits 
plotted  against  C.  It  is  broken  into  three  sections  for  the 
three  nominal  values  of  pp  that  were  used  to  generate  the 
simulation  data.  A  curious  anifact  of  the  data,  which  has 
not  been  explained  to  date,  is  that  the  slope  of  the 
regression  line  for  the  entire  data  set  is  different  than  the 
slope  for  any  of  the  three  subsets. 


x;  =  (  c  virrsc  ) 

Figure  6 


Predicting  Mean  Diameter 


Empirical  Results. 

The  estimates  for  (Iq  and  ft  were  tested  on  a  small 
amount  of  real  data  from  arrays  of  (nominally)  3  micron 
polystyrene  microspheres.  The  diameters  of  these 
microspheres  have  a  certified  mean  of  2.978  microns  and  a 
certified  standard  deviation  of  0.025  microns.  The  results 
from  three  packed  arrays,  after  corrections  were  made  for 
random  and  systematic  error  due  to  the  photographic  and 
measurement  processes,  are  (in  microns); 


c 

Sc 

K 

A 

0 

A 

PD 

Array  1 

2.9826 

0.006H 

14 

0.033 

2.971 

Array  2 

2.9812 

0.0058 

11 

0.029 

2,972 

Array  3 

2.98.35 

0.0096 

17 

0.040 

2.967 

95%  Prediclion 

95%  Prediclion 

Limits  for  a 

Limits  for  pj) 

.\rray  1 

0.023 

0.047 

2.962 

2.980 

Array  2 

0.020 

0.042 

2,963 

2.980 

Array  3 

0.027 

0.057 

2.958 

2,977 

The  predictions  (1[)  are  consistently  lower  than  the 
certified  value,  suggesting  that  the  simulation  is  not 
p;icking  the  circles  as  tightly  as  the  spheres  are  packed  in 
reality.  However,  the  95T  prediction  intervals  do  cover  or 
nearly  cover  the  certified  value.  Also,  the  prediction 
intervals  for  a  are  narrow  enough  that  they  w  ill  he  useful. 

I’revioiis  .Sliidv  of  Hexagonal  Arrays. 

The  idea  of  me.isunng  row  lengths  in  hexagonal  arras  s 
to  glean  information  about  the  mean  diameter  is  not  new 
The  relative  size  of  the  bias  introduced  hv  air  gaps  was 
studied  empincalK  In  Kubuschek  in  I'lbl  |2|,  using  array  s 
of  1(10  washers.  Results  liere  suggesi  that  his  estimate  of 
bills  Is  too  huge  lor  this  application,  as  will  be  shown 
be  hue 

Kubilschek  bnind  ihai 

('  I)  o  tf.  .S|)  (111 

(.N'ole  lti.it  ,S[)  Is  not  .oail.ihle  !■  a  ihe  i  enter  disi.ince 
tindii.g  lechnniue  whislio  wli\  tl  ,■  (o[ irn.itK ei  abiv e  did 
no!  u.e  111  lo  .IC',  onnt  b"  c  li.ilienini;  ih.il  I, ikes  pl.ne 
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1  o.yH 


when  two  polystyrene  microspheres  touch,  C  is  divided  by 
(0.999+0,00 Ip)  for  the  present  application,  giving: 

C  ^  „ 

(0.999  +  0.00 Ip)  " 

From  the  model 


^  “0.999 +  0.00 Ip  ■  0.999 +  0.00 Ip 
and  Eq.  2  we  get 

- _ C  1.3443pSD 

^  “0.999  +  0.00 Ip  ■  0.999 +  0.001p 
This  suggests  doing  a  direct  fit  of 

C  „ 

0.999  +  O.OOlp  “  ®  (■5) 

using  the  lower  bound  for  p  (  a  =  1.4889).  Both  Eq.  14  and 
Eq.  15  give  much  smaller  estimates  of  the  coefficient  of  Sp) 
than  Kubitschek  gave; 

Coefficient  of  Sq 


Model  (Eq  14) 
Direct  Fit  (Eq.  15) 


Perhaps  Kubitschek's  overestimate  can  be  explained  by 
the  fact  that  the  majority  of  his  data  is  from  washer 
distributions  with  Just  two  or  three  distinct  sizes.  It  is 
interesting  to  note  that,  applying  Kubitschek's  equation  to 
the  empirical  data,  we  would  get  more  dramatic 
underestimates  of  the  certified  value  than  we  do  using  the 
results  from  the  simulation. 


Conclusion. 

Standard  Reference  Materials  of  polystyrene 
microspheres  are  used  for  calibrating  optical  micro.scopes. 
Packing  the  spheres  into  hexagonal  arrays  instead  of 
forming  chains  with  them  on  a  microscope  slide  gives  a 
relatively  quicx  measurement  technique  for  doing  this. 
However,  the  raw  estimates  of  diameter  mean  and 
standard  deviation  are  biased  because  of  air  gaps  between 
some  pairs  of  spheres.  So  far,  simulation  has  proven  to  be 
the  only  way  to  examine  this  bias  and  develop  a  correction 
for  it. 
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Maximum  Queue  Size  and  Hashing  with  Lazy  Deletion 


Oairc  A/.  XfurJiini.  Priiir«‘f<iii  riiiv<T^ily 
.Irtfrry  Scott  \'itti  i.  Brown  I  iiivi-rsity 


1.  Introduction 

Qni'in'iuK  j)li<‘n<nin‘na  aro  widospii'ad  in  rlu'  tinlds  of  opci 
atinu,  sy^tt'ius.  dislriliiifi'd  ''ystcin>.  and  |)<'i-f<iriuan<<-  I'val 
nation.  QniMii'>  art-  also  a  initnial  way  to  niodfl  tlii'  si/t-  of 
dassiral  dynainii'  tiata  stnntnri-s.  sucli  as  hntfors.  dii  tio 
nancs.  sots,  stacks,  ((ncncs,  priority  ((iictics.  and  swccpliiic 
^triicfnrcs.  .As  a  consctjncncc.  niany  statistical  properties 
of  tpietles  have  tas'ii  investi)!,ated.  sucli  as  their  cxjiecteil 
si/e  and  variance.  A’et.  v''rv  little  was  ktiown  ahoni  the 
niaxiiiiiitn  st/e  of  ipieiies  over  a  ifiveii  ;M-riod  of  time  If  the 
si/e  of  tlie  (jlieiie  ri'Jire.sent  s  tlie  ainonilt  of  resource  tlse<!  hv 
a  computer  program  or  a  systems  l■omponenf.  then  such 
mfoi Illation  is  impoitanf  for  makiiii'  intelligent  decisions 
ahont  preallocariuii  resources, 

■Anotliei  motivation  foi  our  stud.\'  was  the  need  to 
develop  ami  an;d,v/e  practical  spaci'  efficient  metho'ls  foi 
piocessini'  swis’pline  iiifori’.ation.  Some  Work  in  this  area 
has  h<vn  dtine  hv  A'ittei  and  \  aii  A\  yk  lOtSCl.  Morri 
son.  Shepji,  ami  \an  \\  yk  ilO^T.  ami  Ottmann  ami 
Wooil  1!)n(',.  hut  as  till  lattei  point  out.  ''Surprisimi.h 
thei..  lifiN  hnl*-  Tht'drcficjil  of 

♦  M*oiioini(‘al  aliioi  ii f'vru  '•udj 

nTliiiis  havr  signiti^’anT  practical  appli<'ati<>n''.  ()ttijiaiiii 
aiiW  Wood  lO^Cj  do  not  m,atc  the  maxiiiiuin  iium 

her  <if  CUT  hy  tiie  ''Wee]>liiic;  they  ♦*xpr<*S'‘  tljc  run 

liiuki.  tinw-v  of  their  al^»o^lthlu^  in  terms  «>{  rli<-  maximnm 
numl)er.  Our  a]>proach  in  fins  |»ajM'r  is  to  examine  the 
di^'t rihut ion  of  the  maximum  numlM-r  of  eur  items.  ijase<i 

•  111  several  popular  input  mod<’ls.  and  m  a<iditiou  shi»w 

that  tlie  'ha-shinfi;  with  la/y  deletion  (HwLDt  alj^orifhm 
introfjm’e<l  in  \  an  W  yk  and  \  itfei  .  is  <  xtr<‘m<’ly 

practical  and  optinmm  in  Inith  av<*rai<e  running  tjnie  and 
prealI<K’atef|  spa^'e, 

We  <{evelop  new  mefluMl.s  and  obtain  M’veral  results 
about  the  distribufKjn  of  the  maxunui;)  <jueue  si/e.  un 
th-t  -r-\<Ta]  models  of  growth.  We  stufly  stafi<jnarv  hirth 
ami  death  process<*s.  and  are  part nmlarly  mteiestfal  ui 
M  b\r^  X  am)  the  more  general  M/'Ci/x  fpieiiev.  wlmdi 
niofle)  the  amount  of  j)lan«'  sw»*ep  informatKai  as  a  fum* 
fjoti  of  tune  We  alMi  Coijcejjf  rat  e  on  HwLI).  wluc}»  Is 
a  noil  Markovian  <jueueing  m<Mlel  corresponding  to  th*- 
'pae«-  iisage  f*f  the  algorithm  )fV  the  'tame  name  hi  ad 
diTioij  We  -^tmly  a  non  sfationaix  mode)  cories.pondmg  t«j 
histories  <»f  priority  queues 

['lane  sWe^’p  algontliins  Jiloces...  sequence  4)f  llenis 
over  time  at  tune  /  the  <lata  -stnictuie  stores  the  itemv 
»hat  are  living  at  tun*  t  Let  us  think  of  tije  mIi  item 
a-  being  an  mter\al  m  the  unit  mtetval.  cotrfam 

mg  a  unique  key  of  sujipieu.entarN  mformattfui  I  he 


tfh  ifeni  is  'dforn  at  time  ■«,.  ■'dies'  at  time  t,.  an<l  is 
"living*  at  time  t  when  i  The  rlata  stiiicnue 

mii.sf  )>e  able  t<j  su}>port  the  dynami<’  ojnuation  of  'sear<-ii 
ing  the  living  items  ba.s<*d  on  key  value.  It  is  natuial  to 
think  of  the  data  strncture  as  a  «pieue.  as  fai  as  sj/e  is.  .-on 
cern<*<l.  Let  ns  <U*note  t  lu*  <jnene  si/e  at  timetby  .Vfo/iti 
tin*  number  of  items  that  n<x*d  to  be  included  in  the  «lata 
strui-ture.  If  We  fliink  of  the  items  as  hoii/onial  inter 
vals.  then  Nffdiii  is  just  tlie  nnmlxT  of  intervals  '■I'ut  l»\ 
th<‘  vertical  sw^-^qiliiie  at  position  t.  hi  a  tyiuca!  a]»pli‘  a 
tion.  We  may  hav<'  10*’  intervals  in  th'-  time  range  0.  !  . 
With  Ei  Nft’d  )  ~  H)L  that  is.  only  squar<‘  r«>ot  of  the  total 
mimher  of  items  temi  to  l)e  present  at  any  gi\  eii  rime  S/y 
uiauski  and  Van  W’yk.  19S31.  If  is  thus  \(-r\  inefficienr  Ui 
<iev<>te  a  s<q>arate  storagi*  Kwariiui  to  ewry  item;  the  <lafa 
structure  should  be  dy  namic. 

In  HwLD.  itf'rns  are  stfirerl  in  a  liash  table  of  H  buck 
ets.  based  upon  the  ha.sh  value  of  the  k»‘y.  The  distin 
guishing  feature  of  HwLD  is  that  an  item  is  not  deleted  as 
s<M»n  as  it  dies;  the  *’la/y  deletion  strategy  deletes  a  dead 
xifiu  only  ulien  a  later  insertion  accesses  the  same  bui'k'  t 
The  number  H  of  buckets  is  ciiosei)  so  that  tin-  eX]M*ct«'d 
number  of  items  jier  bucket  is  small.  HwLD  is  thus  mon 
titiie  etfi<'ienf  thait  doing  "vigilant  <leletion.  at  a  ciist  of 
storing  some  4lea<l  items. 

Let  \  t  <  be  the  liumbei  of  items  lU  the  HwLD  data 
sfru<*tnre  at  tim<’  t.  It  is  shown  in  \  an  W’yk  and  \  it 
ter.  1!)S0]  for  the  M/M/ x  imMle]  that  at  any  gis’en  tmi''  t 

we  have 

E{  >  -  E\  Nffd  )  ^  H  -  ~  *  H . 

where  is  the  birth  rate  of  the  intervals  and  1/p  is  ilje 
average*  lifetime  jxT  iteni.  The  Hinoiinf  of  wasted  spaie 
IS  «x|uh1  to  th»*  nmnlK*r  H  of  bnrkt'ts  .A  possible  c  hoi«e 
of  //  is  //  —  Ei  Nf'ed  )) ,  M>  that  tlie  <‘xiM‘Cted  aim>unts 
of  spa<‘e  and  time  ummI  by  HwLD  are  optimal  u]»  to 
a  constant  fact«>r  i  In  practice,  the  computei  un  niory 
spa<'<*  »is<mI  by  HwLD  is  pft»‘u  /ess  f/jaji  the  spac<  us<‘4}  by 
vigilant  iieletion  slratVgies.  IwrauM*  the  latter  ar*  ty  pi 
t'ally  bastai  <»n  balfuircxl  fnx*s  ami  priority  ipjeues  which 
recpiiie  m<»re  stfirag#'  <»verheail  ipomtei  inf« •rmai ion  i  p«  i 
Item  I  h  vva.s  conjectured  m  \  an  W  y  k  an<l  \  itt»  i.  I'Ni'i 
that 

E{  max  {  S'  •  ih  *  '  \  ' 

If  /•  I 

1. 1  max  (  f  *'<  1 1 '  I  max  {  A  f  r  f/(  t  1 1  ) 

<e  f-  I  If  f.  1 

which  would  prove  that  HwLD  is  als4i  optnual  in  onn 

»»!  />/eaiioi-afed  -tojage  \  s\'-teiii  .tt  iqinUii.n-  1''i  'I.- 

dl'tJtbutKUl  of  {  S  •  >  (i  ^  f  and  foi  M;'  .  if  u .  n*  !  a  ’ ' 


Ol  // 
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H  =  \  distribution  of  niaX({  ffselt)}  in  equilibrium  for 
the  M/M/x  model  whs  recently  developed  in  [Morri¬ 
son.  Shepp,  and  \an  Wyk.  1987].  They  can  i)e  used  to 
get  numerical  data.  Both  distributions  are  nearly  irlenti- 
cal.  because  when  H  =  1  we  hiive  inax,>/"  j  JVccd(  / ) }  = 
maxoc  {  )}•  when-  t'  is  th<‘  birthtime  of  the  first 

item  to  enter  the  queue  after  time  i  =  0. 

In  this  paper  we  att;iiii  an  array  of  results  al>out  th<’ 
maximvim  tpiette  size  usiiiR  two  ind<‘peiKlenf  approacln-s. 
(Due  to  space  limitations,  details  are  deferred  to  the 
full  i>aj)er. )  In  the  first  apj)roach.  described  in  thi'  next 
section,  we  develop  several  formula.s  for  the  distribution 
of  maX(  {  Afcctfl  f )  I  for  jfeneral  birth-and-death  prot’esses 
(which  includes  the  M/.M/x  process  I  and  for  the  dis¬ 
tribution  of  maX(  (  f/.<e(  / )}  in  the  B;eneral  H  >  \  ca.se  of 
HwLD.  We  also  handle  a  non-stat ionary  ukkIcI  descritx'd 
in  [\  itter  and  \an  Wyk.  1980).  The  formulas  provide  ex¬ 
act  numerical  data  on  the  distributions.  an<l  in  some  ca.ses 
lead  to  asymptotics  as  the  time  interval  i!;rows.  There  is  a 
common  underlyinfi;  structure  in  the  formulas  for  the  dif 
fereiit  models:  the  transform  of  interest  in  each  case  is  the 
ratio  of  consecutive  classical  ortho(i;onal  ixilynomials. 

In  our  secoml  ai)proach.  descrifx'd  in  Se<uion  3.  we 
prove  the  above  conji'ctures  for  the  neiK'ial  .M/G/x 
naxiel.  which  includes  .M/M/x  ;is  a  special  ctise.  \\<' 
obtain  optimal  bitr-oh  bound'  on  th<'  expis-ied  max 
imum  ijnetu'  si/i'  by  u'iiiu,  nou-i(Ueuemft  theory  tech 
niques.  We  tqtproximtite  the  continuous  time  prtx-esses 
niiiX)  I  jVcft/i  1 1 }  and  maxi  {  b'si  i  ^  i )  by  sums  of  <liscrete 
qutintities  related  to  IkisIiuiu..  specifically,  maximum  slot 
oecupiuieies.  i  The  liashim;  in  out  approximation  scheme 
hiis  nothini;  to  do  with  the  ha'hinn  inherent  in  HwLDo 
(  )UI  techni'lUes  also  s<'em  applictlble  to  other  <|UeUeimi, 
nnxlels,  siich  as  M/M/1. 

2.  Formulas  for  Maximum  Queue  Size 

It  is  conv*-nient  to  extend  the  rani!,e  of  time  to  |0.  T|  for 
ai  bitrary  T:  the  results  can  be  translated  back  l<.  T  =  1 
later  In  the  followine,  s<’ctions  we  ilerive  exact  formulas 
for  the  ilistribtition  of  the  maximum  <|uein'  size  in  s<’v<'ral 
uxxlels  Our  formulas  are  annuiable  to  numericHl  calciila 
Hon  anil  yield  asymptotic  e.xpn  ssions  in  some  eax-s. 

The  problem  has  bei-n  studied  jueviotisly  m  (Morn 
son.  Sill  pp.  and  \  an  W>  k.  19^7  foi  the  spe.-;-..l  ......es  of 

M  .M /  X  and  the  H  1  ca.seof  HwI.D  Howevei .  analysis 
fiU  the  case  H  “  I  cannot  be  Used  to  i!.et  a  liiMxi  ixiund  fol 
vvlien  //  *  1 ;  a  corollarv  of  out  analysis  m  .Section  .3  is  i h;it 
H  niaxi  j  f '.o  1 1 1 ) )  is  ivpicallv  ureatei  than  iinix,  |  f  n  1 1  1 1 
by  more  than  a  constant  factoi.  where  I’m  itt  i  is  the  oc 
ciipaiicv  of  bucket  1  at  time  t 

.\  birth  and  deatli  process  is  a  .Markov  piocess  u, 
wliicli  tiaii'itioiis  from  level  k  are  allowed  only  to  levels 
k  -  1  and  k  I  We  shall  restiict  oiii  selves  to  coni iniious 
tune  in  till'  exjiositioii  Borrowing  notation  from  Hwl.l). 

\M  dehlie  jVbtlil  t  i  to  be  the  level  of  the  jirocess  at  time  t  _ 
rill  mfinitesiiiial  biith  and  death  rates  at  level  k  an-  de 
noted  V*  and  pi 

for  the  special  casi  of  the  .M  .M  ,  X  model  we  Wllte 


Ao  =  Ai  =  ■  •  ■  =  A  and  pt  =  tp.  For  the  M/M/1  model, 
we  write  Ao  =  A]  =  •  ■  ■  =  A  and  pi  =  pr  =  •  '  =  P 
In  both  cases,  the  arriv'al  prix’ess  is  Poisson,  and  for  the 
M/M/  X  ca.se  the  lifespans  are  exponentially  distributed. 
The  reader  can  consult  [Kleinrock.  1975]  for  further  back 
grotmd. 

In  Sections  2.1  2.5.  we  derive  exact  formulas  for  tin 
maximum  queue  size  usin^  a  variety  of  aliiebraic  and  ana 
lyrical  techniques.  The  first  thris'  sections  handle  the  case 
of  (veneral  honiojceneous  and  stationary  birth  and  death 
processes  in  equilibrium  at  t  —  0.  the  foiutli  discusses 
HwLD  uniler  the  M/M/x  model  in  equilibrium  at  t  =■  0. 
and  the  last  deals  with  a  non  stationary  model 

2.1.  Applications  of  Stack  Histories 

.A  Dyck  path  is  a  walk  in  Z‘  alxive  the  r-axis  such  iliai 
each  step  is  of  the  type  {(i.b)  — *  (u  -i-  1. /i  ±  1 1.  Its  level 
is  the  maximal  ly-cixirdinate  reached.  Dyck  paths  are  a 
special  ca.se  of  file  histories:  they  correspond  to  histories 
of  stacks  [Flajolet.  Franijon.  and  A’uillemin.  19801.  (File 
histories  will  be  discussed  further  in  Section  2.3.  '  Let  _■ 
be  a  Dyck  path  ,i!;oinE;  from  level  /  to  level  (  in  n  steps, 
and  with  height  constrained  to  be  <.  k .  For  each  'uch 
we  define  /(_i  T '  to  be  the  probability  that  in  time  interval 
'0.  Tj  the  siiccessi ve  different  states  of  the  ])rocess  ,Vi  i  i/i  t 
correspond  exact ly  to  (iiven  that  Abii/illi  “  i. 

Lemma  2.1.  117 /jae<. 

Pi  [  max  j  .Vi  I  i/i  1 1  [  k-\ 

!*</<'  I 

=  ^  ^Pr[  Vii'ii'iOi  =  1 1  ■ //.I  7*1  j. 

.As  an  example  of  our  methoil.  let  us  consider  the 
.M/.Al/l  mtxiel  with  [larameters  A  anti  //.  The  equilibrium 
probabilities  are  Bjiven  by  Pr{  Vccti  =  i|  =  i  \/p  lb  1  -  \/p  i. 
It  remains  to  calculate  the  terms,  which  can  be  ex 

presseit  as  a  nmltiple  mteirral.  In  fact,  p.lF  i  do»'s  not 
di-peud  uiH>ii  the  actual  shai)e  of  u.',  but  only  uiatn  the 
ijiimlxT  of  tunes  the  path  hits  the  c  axis  Tsine,  that 
^ives  Us  i>^{T )  m  simple  summation  form  Lemma  2.1 
can  thus  be  applied  to  yield  an  exact  expression  for 
Pr{  max,,.  ,.  /  {  Veerf I  / )  (  1  } 

2.2.  Orthogonal  Polynomials 

We  <*iin  «‘xt»'ii<i  iln'  l>y  Ntonissfiii, 

ami  \  an  Wyk.  for  ili*'  M  M  x  iiiodrl  to 

Kirth  aud  ilrath  prorfss«“s  \\t  iia\*' 

{  lua.x  lA’fM/.f  d  ■  k  1 

...  (.  f 


wlan-  i"  da  d«u-'ii\  of  tin  tii  '  ’■ 

|r\o|  k  starimii,  fioiu  lr\«  }  / 

f  la-s<‘  flriisit  I  ,  4  I , .  i  ai '>o)\»t  ion-'  of  .w  ' '.r «  tn  of 

lilt I'u,!  al  rQuai  mils  r  akiMt  l.aplaar  1 1  aiisjoi  n i '  ■'  a  '  u  • 
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gf't  another  system,  from  which  we  find  that  iTj*(.s1  = 
)  is  a  rational  fraction;  its  jioles  are  roots  of 
u-'t.  and  yield  Sj.tl/i  and  thus  Pr{maxo<(<v  { JVccrf(t)}  < 
i  ).  Moreover,  when  Need[t)  is  a  hirth-ami-death  proce.ss. 
computing  the  roots  of  is  an  easier  task  hecause  {..Cj} 
is  a  family  of  orthogonal  polynomials,  and  when  T  go*‘s  to 
infinity.  Pr ( max,, < , < 7  ( jVccrf ( / ) }  <  <■ )  ~  AV'*^  .  with  K  a 
constant  and  o  a  root  of  u,*  with  maximal  nKnlidns. 

Karlin  tmd  McGregor  (1958)  introdnc<‘  the  family  of 
polynomials  {Q,i(x)}  with  the  properties  that  Qn{xt  ~  1 
and  ~xQ  =  .AQ.  where  .4  is  the  infinitesimal  generat<ir 
matrix  defined  so  that  is  etpial  to  .\r  if  j  =1+1. 

“  M  -  /n  if  ]  =  k.  /;r  if  j  =  k  -  1.  ami  0  otherwise.  It 
turns  lait  that  Q„(j  i  =  +„(-.r|.  This  expression  gives 
an  extremely  simple  tiK>l  for  linking  hirth-amhdeath  pro 
cesses  to  cla.ssical  familu's  of  orthogonal  jiolynomials; 

Theorem  2.1.  For  tht‘  .\//.\J/J  /jroce.ss,  we  ijave 

(  \/oy'Q„if'r)  =  T„<  ■  i - ;  I  . 

V" 

where  II  -  \/fi.  :  =  - 1  .r  ~  a  -  and  is 

rlie  fniiiily  of  Chohyshrv  orthogonal  polvuomials  For  the 
.M/.M/x  prf)eess.  we  have 

.r  I  =  I  -  1  .r/fi  I, 

whi'n-  {Cy '1  II  If  Is  the  fniiiily  of  Pi  Jisson  Clotrlii'r  orthog 
onal  /aijynomiais.  For  several  f.v/ies  ol  linear  /lirfh-and 
i/eath  processes,  0/  the  form  =  ol'  +  h  pr  =  - 1-  c. 
Qp.ri  can  he  e.vpii'ssed  m  terms  of  either  Lag'ueri'e  poiv 
nomials  or  Mi'ixwi  jioiynomiai.s  of  rhe  seco/a/  kind 

General  hirt  li  and-death  pro<'esses  can  also  he  r<' 
lated  to  orthogonal  polynomials,  using  the  framework 
of  file  liistorii's  disctissisi  in  IFlajolet.  Franyon.  ami 

X’liillemin.  19801. 

2.3.  Continued  FVactions 

File  histories  model  the  evolution  of  several  <lassical 
fypt's  of  (lynatnic  data  structures;  stai'ks  iS).  priority 
i|ueues  iPC^i.  iini’ar  lists  iLLi.  symhol  tallies  i.ST/-  ami 
dictionaries  iDi.  The  data  structures  are  treateil  as  com 
hinatorial  ohjects;  thi'ir  iierformame  characteristics  are 
determined  hy  the  relative  oriler  of  the  elements  they  ci>n 
tam.  not  hy  the  actual  \alues  of  the  elements.  Thus,  we 
say  that  iheieiuelsl  ways  of  insert  ing  a  new  element  into 
a  dictionary  of  si/e  1  .  since  there  are  1  -►  1  "gaps  where  tlie 
new  element  can  fit  in.  i+iative  to  the  k  elements  already 
present .  The  et'olution  of  the  ‘lata  strtu'ture  Is  represent etl 
as  a  (tath  in  Z"  1  tlie  /  coordinate  counts  the  nuniher  of  op 
eitttlons.  whether  tiiey  he  msertutlis.  deletions  or  <|Ueries. 
anil  tlie  y  coordinate  counts  the  si/i  1.  when  eacii  step  IS 
of  the  i.vpe  III.  hi  •  III  1.  \.h  r  1  I  I  insertion  or  deletion  1 
01  III. Ill  •  III  s  l.h)  ipositive  01  negative  i|uery  1  To 
each  steji  we  associatt*  a  I'ertain  choice  tiinong  the  Jntssi 
liihiies.  each  e<)ually  likely  Foi  example,  m  pi  ioiiiy  i|ueiie 
till  '  deletions  can  he  jM'itoimed  only  foi  tin  mimmumel 
■  inent.  so  tin-  numhei  of  possil uht les  for  a  deletion  is  1 
foi  purposes  of  hrevitv,  let  m  lestnct  olllsehes  to  1  he 
,M  .M  X  iiioilel  in  winch  t  /i  Ihis  process  js  lelaied 


to  histories  of  symhol  tables,  in  which  the  number  of  pos 
sibiliti*-s  for  ins<‘rtion.  deletion,  and  (juery  at  level  k  are 
equal  to  k  +  1.  1.  and  k.  respect ivel.\'  (Flajolet.  Franyon. 
and  Vuillemin.  1980).  We  let  be  the  ordinary  gen 

erafiug  function  of  the  niiinber  of  symhol  table  histories 
going  from  level  j  to  k.  and  we  define  ffyjjtti  similarly 
<‘.\cept  with  the  histories  consiraineil  to  ha\'e  height  ■'  h. 

Let  us  consider  the  bounded  process  A,,  =  \|  =  = 

\/,-i  =  A.  X(,  =  0.  pr  =  kp.  whose  height  can  never  exi-eed 
level  h  (this  process  can  be  denoted  M  /  M  /  /  h  i.  W'l 

define  J(f  I  to  la-  the  associated  density  function  for  the 
first  j)a.s.sage  time  to  level  k.  If  we  call  ,  (  s  i  the  Laplaci 
transform  of  'r*_|(l).  then  rty^  ii.s)  is  the  solution  of  the 
system 


for  /  ■  h,  which  can  Ire  put  naturally  into  the  form  of  a 
continued  fruition:  try*  |  (  -//(  ]  +  l/si)  i‘<juals  ifi  h  y  ^  ' 

times 


1 


.411  file  histories  sis-n  so  far  have  their  height  hounded 
alrirve  or  helow  hy  some  constant.  This  is  due  to  our 
concentrating  on  times  of  first  passage  through  a  stati-  i . 
which  implies  that  U-vel  I  must  Ire  a  harrier  for  the  his 
terries;  they  must  not  Ire  ailowcfl  to  go  through  siati-  i. 
But  if  we  now  renarve  tne  rirnstramt  of  first  passage  and 
cousieler  P*  Pd )  =  Pr{.V<((l(t|  =  (  j  .Vki/iOI  =  k\.  in 
the-  same  way  we  now  gel  itt  rt.s)  the  Laplace  transform  of 
Pr  idl.  Taking  tin-  invers*-  Laplace  transform  will  finally 
yh-ld  'jtif)  ainl  P,  i,l.1). 

2.4.  Hashing  with  Lazy  Deletion 

TIh*  H  =  'i  \u  whirh  Th«*rr  is  uo  hH>hiut>  aii<l  a 

vii^iiaiit  <lrlrtjo!i  stratrjyt'  is  whs  aualy/t‘(l  in  M»ii 

risoij.  Sh<*j>|>.  aij<l  \'aii  Wyk.  198G;  VW  ran 
!<»  H  1  Gv-  Hiu  »lir 

at#'  <'<»ii<liti4>iial  pn iGahilit u‘s  Lrt  foi  viiiiplicit \  .mh 
th<‘  T\v#>))urk#'t  I'ftsi’  H  -  f«n  l»n«'k«i  /.  w « 
aiui  jVffrfjtfi  HI  t|n'  ohxiuu-  way  an«{ 
tiiH'  --  Nf'rdfit}.  \\f  lja\‘t' 

S**(l  i\1  Sftd  / 1  Wa.<^  )  I  / 1  +  /  1  W  #■  full 't  nil 

put#-  th«‘  hist  pa.ssaj>;r  tiiiu-  (l»‘tisitH"'  Hsiuii.  I.ajilarr  tiaii- 
f«niiis  and  pn>l)ai)ility  t«*rliiii<pi<*s.  whirh  allow"  U"  ft>  ral 
#iilat<*  th<‘  ^iistrihiition  and  iiiraii  of  ina.v,,.  , 
r)!s<'Ussi<»n  will  1>4‘  #it*ffTn*d  to  th#'  full  pap#M 

2.5.  Hermite  Polynomials 

W#-  fh#'  fit  it  \  \  jJUjtnjurttJ  )jj  \,)j 

W  \  k  aii«l  111  '•vhi'li  til*  ?r,  1  >11  1 1(1  Hill  - 
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<l<‘athtini<’.s  of  flic  u  items  arc  iiKicpciidciit  tmiform  rati 
(lorn  variables  from  the  imit  interval.  The  tth  item  is 
born  at  time  niin{.',.t,}  and  dies  at  time  max{.',.  t,  | . 
Th*'  average  queue  size  —  2tif(l  -  /)  attains 

its  maximum  o/2  at  t  =  1/2.  Thi'  question  of  iiit«Test 
is  to  determiiK'  the  distribution  of  the  random  variable 
ina.X(i<(<  I  { t ) ) .  We  shall  se<'  that  it  is  the  same  as 
the  height  of  a  jiriority  queue  file  iiistory  as  diseu.ssi'd  in 
[Flajolet.  Franyon.  and  Wiilh’inin.  1981). 

By  studying  involutions  with  no  fixpoints.  we  can 
show  that  our  problem  is  e(|uivalent  to  determining  the 
distribution  of  the  maximum  size  of  a  random  prior 
ity  qiK'ue.  We  denote  liy  //r„*  the  nuinbi-i  of  prior 
ity  (im'ue  histories  of  length  2n  and  height  <  //.  and 
we  let  f/r/'l-)  lie  its  rorresponding  generating  function. 
We  have  Pr{niaX(]<(<  ;  j  »(/( / )}  1/ |  =  //r,f/(l  ■3- 

...  ■  (2n  -  II).  Flajolet  [I9S1)  shows  that  = 

-  1  ( -  where  I  1 /;  t  is  the|//  i-  1  )sl  or 

thogonal  Hermite  polynomial,  wiiose  roots  are  real  and 
distinct.  This  allows  us  t<i  get  a  'iuiple  exact  expression 
for  Pr )  maX(i< (<  /  {  t ) }  c  asymptotic'  ap 

proximation  as  u  -  x . 

3.  Optimal  Bounds 

In  this  section  we  jirove  for  the  stationary  M/G./x  model 
that  the  expecteti  ma.'dmmn  stoiage  needed  (that  is.  the 
expected  maximum  .M/G/x  ipteiie  sizel  and  the  expected 
maximum  storage  usetl  in  excess  of  that  amount  are  within 
constant  factors,  respectively,  of  the  exiiected  storage 
needed  and  wasted  at  ;my  iriven  time.  The  birth  rate  is 
a  Poisson  process  with  intensity  \.  In  the  special  case 
of  the  M/M/x  tno'lel.  the  lifespans  are  given  by  the 
exi>onential  distribution  with  tiiean  1/p,  In  the  general 
M/G/x  model,  the  lifespan  distribution  is  arliitrary.  with 
mean  l//i.  The  following  two  theoienis  are  the  main  re 
suits  of  Section  3: 

Theorem  3.1.  He  have 

E (  nitix  (  Srid[  t ) )  I  =  0(  £"(  Ni  rd  l)  =  O  (  ^  . 

•1C(C|  \  f,  J 

under  the  condition  that  fi  ---  G(  \/log  \i  m  ihr  M/M/\. 
c;ese.  and  //  =  f/(  A/log*  -M  in  the  general  .M/G/x  ca.se 

Theorem  3.2.  Let  f  ^  0  be  any  constant  Tlien  if  the 
number  N  of  buckets  m  Hw'LD  1"  t  \  t*  .  W'»*  IlHVf 

E (  max  {  i  1 1  max  {  Nttdi  / 1 1 ) 

u<t<i 

(){ E{  E *t  Nffti  i)  —  Oi  H  ). 

Thr  rrstrirfioiis  on  fi  ami  H  in  fli<'  (iuHin’iiis  ar<'  <'X 
w#*ak;  v  an’  Ty}»i<  ally  m«'f  in  #m<*triral  aj> 
]»!u  Htions.  for  \aii  W  yk  ami  lOSGj.  lii 

fa*  t.  it  fan  Im-  >howii  tliat  Tlwoifin  3.1  !*>  not  triif  if  fi 
i"  t«K)  laix‘  -  !<'striftion  tims  ]»artly  inlifp-nt  in  rlio 
For  TIn-onni  3  2.  lumrvrr.  \vv  ronjortijif  tliat 
til*  n-'xt rn'tion  H  i  \  t’  * '  )  ‘  Bn  Im-  Uft«'<l, 

\\r  }>rovf  Th<*<*irni  3  1  in  tin-  ii*xt  s<'<  tion  ami  rin* 
'd'  ln  3  2  m  S«rtj<in  3.2  Om  appioai'li  f(H  l»otli  i-.  to  ap 
proximafr  th**  (jnrm  imt  prof  <  xs  hy  a  s<'(pt#  n/  r  of  srai»»-s  of 


a  (liserrte  analog,  which  wc  call  hh^lnua:.  The  pai 

ticular  forms  of  time  hashing  w<*  ns«*  for  rh<‘  two  ea.M-v  ;o* 
Titfcrcm.  But  they  sliare  tiie  common  property  tlnn 
the  <'arly  stap;es  of  tin*  firm*  Ita.shiiig  capture  most  of  what 
is  goin^  on  in  the  (picneiug  pro<*css:  in  tlie  later  '.ta|i,e>s 
the  uumi)<‘r  of  slots  in  the  hash  table  hei'ouies  small*-!  ami 
siiialltT  (ami  each  slot  covers  a  larger  sj)aii  of  tim»-i  anb 
fin*  confrihnfion  heroines  |<*ss  ami  les.s. 

3.1.  Maximum  Size  of  M/G/x:  Queue 

This  s<*ctioii  is  rlevoted  to  tin*  jinnif  of  Tlieorem  3.1.  Th* 
imnilx*!'  H  of  hnrkefs  in  fh<*  HwLD  implemenrarion  (lo«  «s 
not  affect  tin*  value  of  Need  in  any  way.  m>  w**  sliall  assuin*- 
in  this  -sertioi]  that  H  =  \.  The  flisrri})iirion  of  Nf  fthf  >  i^ 
Poisson  with  mean 

Lemma  3.1.  For  rite  .\//G/m  ifio*/**/.  we  h^vr 

(2)' 

Pv  I  A'/ 1  dt  1 1  =  / }  - - -  '  -  . 

The  prtMtf  of  Theorem  3.1  relies  on  the  following  tech 
ni<iuc  we  introrinee.  called  rime  ha.'hhjg:  Let  h  he  an  in 
teger  parameter  to  be  sitecified  later.  We  siiall  consider 
all  items  that  are  alive  at  .some  rime  dining  /).  1  .  Stages 

!■  =  1).  1,2 . A  of  time  hashing  are  definetl  as  follows: 

For  0  51  k  <  A',  all  items  ( intervals  |  that  have  lifi-span 
in  the  range  i  |;2*''.  ^2*1  and  that  are  horn  in  either  the 
unit  interval  ( 0. 1  j  or  one  of  the  end  intervals  (  1  2* .  ()]  and 

1 1.  .■//2"  *1  art'  ;>ni  into  stage  L:  in  addition,  for  L  =  I). 
the  lifespan  re<)uirement  is  weakened  so  that  the  lifespan 
must  be  in  the  rtuige  jo.  ^].  Each  stage  consists  of  a  hash 
table  of  fp2' *1  +  1  slots.  The  jth  slot,  for  0  <  j  <  jp2'‘i. 
represents  the  interN-al  of  time  !  -  1  )2*.  ^.;2*].  .-kn  item 

in  stage  k  is  jtlaeed  into  the  slot  rorresponding  to  its  hirth 
time.  Wealso  define  a  speeial  stage  A  1  as  follows;  Slot  It 
consists  of  all  items  l>orn  in  (0.  I]  with  lifespan  >  ^2^  -  ' ; 
tie-  remaining  fp2"' slots  are  left  empty 

We  define  .VtlJ  I  to  1h'  the  numher  of  stage  L  items 
in  slot  The  following  fundamental  relation  tiouiid' 

maxii<  i<  1  { .V(  ( (/( f ) )  hy  the  sum  of  t  he  exjiect ed  maxiiuuiu 
slot  oceiipaneies  in  time  hashing. 

Lemma  3.2.  He  have 

max  {Nf-fdifll  ^  2  }  max  f.Vii/i) 

IIS  tea  *1  -  - 

The  M/M/x  Case.  First  we  shall  handle  the  M  'M  x 
case,  in  which  the  lifetimes  are  exponentially  distributed 
with  mean  1/p.  The  restriction  on  //  in  Tbeoicni  3  1  is 
slightly  weaker  in  this  ra.se  than  in  the  geiieial  .M/G/x 
cas<'.  In  this  siibs«Ttion  we  a.ssmne  that  we  are  dealing 
with  the  M/M/x  iiKxlel  ami  that  //  (h  V/log  \i  Wi' 
define  the  stage  parameter  A  to  be  Tig  In//  . 

Lemma  3.3.  Tlif  rxprrtrtl  nninb/T  of  items  u;  't,/;;. 
A  4  1  IS 

\ 

A(  .A  a  udM)  ^  E{  max  )  -A  a  .  1 1  /  1 1 )  ' 

le:  J.  o/.'  '  ■  /I 

Lemma  .3.4.  For  0  •  i  -  A.  !>■'  n  !>'■  tb/-  ,/w;,,i;, 
numb<‘j  of  il''iiis  III  stag/'  f .  and  h  i  ni  /i2  *  •  I  I" 
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(in'  intmhrr  of  sior.s  iii  thf  titiif  )iHshiiig  of  stn^t-  k-. 

Tin'll  file  iiitiuhtT  .\'k[J  I  of  items  in  slot  j  of  k  is 

Poisson  distrihiiled  with  mean  o  =  11(111.  wiinrc 

t‘ 

"  » 

-11  -(^'k  if  *•  =  (), 

Lemma  3.5.  The  ex/x'crrd  muximiim  ocriipanrv  of  the 
slots  in  sttiX''  k.  0  <  A'  <  A  ,  is 

£(  max  {.\T(.,)|)  =  of4r)' 

Tin’  priHif  of  Lnmiiia  3.3  makes  use  of  (lie  followiiig, 
lemma  and  corollary.  They  B;ive  ns  an  upper  homul  in 
an  easy  way  for  the  expected  maximum  slot  occupancy  111 
hashintt.  The  lemma  is  phrased  for  i;;eneral  slot  occti]>an 
l  ies  Aj  that  are  not  assumed  to  he  independent:  when  the 
occupancies  are  independent  or  satisfy  a  certain  iiroperly. 
the  hound  in  the  corollary  is  obtained. 

Lemma  3.6.  For  rainloni  cana/iies  A’l . V„, .  if 

PrjAy  >  h)  '£  1/lnini.  for  nil  1  <  ,/  £  where 

"  =  J  f  titell  we  iiaie 

A'l  max  {.V.  ()£/»+  -A'(  max  {A,}  I  max  {A.)  >  h). 

Corollary  3.1.  If  in  addifion  to  the  assumption  re</tiire<i 
for  LeinniH  3.0  we  ai.so  have 

E{  max  j  A’j }  !  max  {.V,|  h]  •;  E{  max  (.V^l  |  .Vi  ^  /•). 

\  '^  J''  ni  I  <  /  <  /ft  i 

then 

El  niitx  {-V.  )l  £  /) -t  -  E(  max  { A'^  )  j  A'l  •  It}. 

It  lyi"-',,, 

The  rest  of  the  proof  of  Theorem  31  for  the  M/M/ x 
case  consists  of  takinii  expectations  in  the  expression  of 
Lemma  3.2  and  suhstitutine  the  hounds  from  Lemmas  3  3 
iind  3.3.  which  sixes  a  converseni  sci'lnetrii'  series. 

The  M/G/x  Case.  In  this  suhsei  tion  we  assume  that 
/1  CAA/los'  Al.  For  the  ca.se  of  the  M/G/x  IIKKlel, 
the  distrilmtion  of  lifetimes  is  allowed  to  he  an  arbitrary 
one  with  mean  1/p.  So  in  particidar  the  approach  we  use<l 
above  for  .M/M/x  (naniely.  Lemma  3.3)  will  not  work:  for 
each  sixen  '"aim'  of  k.  stage  k  cotiltl  confrihiite  as  much  as 
f.h  .\/p  )  to  F(max,|._- je  >]  { .Vi(  /  l| ).  Instead  we  irse  the 
following  important  correspondence  lH'tw*s*n  the  av<Tage 
slot  occtipanci<'s  and  F(  iVccti  |: 

Lemma  3.7.  Let  tn  -  £(.\T(()||  /«■  the  average  mmi/>er 
of  items  111  slot  I)  of  sIH^e  k  Then 


VVe  Use  t  inie  hashing  as  liefore.  hut  with  t  lie  st  age  pa 
rameter  set  to  K  =  ;lgp;  I  here  are  ip2  ‘1  +  1  •_  p  )  2 
slots  in  stage  k.  for  each  0  £  k  A.  .An  easy  ap]>hca 
tion  of  Corollary  3  1  gives  us  the  following  key  leiniiia. 
which  IS  the  basis  for  the  priKif  of  fheoreni  3  1  for  the 
,M/M/x  ca.se. 


3.2.  Optimal  Bounds  on  Waste  in  HwLD 


To  prove  Th<S)reni  3.2,  we  derive  an  iiiiper  hound  for 
£(maxi  {  V^o.iifcl/ 1 } ) .  where  Wa.itrit}  =  f/.<c(tl  -  Nfi-dlll 
is  the  number  of  dead  items  that  are  still  in  the  HwLD 
data  structure  at  time  i.  This  therefore  gives  an  upi>er 
hotiiul  on  E(iniiXi  {  U.iftlt)}  -  max,  I  Alcciil  t )} ) .  It  is  im 
[Mirtant  to  note  that  the  former  (juantity  is  usually  larger 
than  the  latter,  because  I7.<c(t)  and  Afccrfit)  typically  do 
not  attain  their  maxima  at  the  same  time  t. 

To  iKHind  the  ex]>ected  maximum  waste,  we  use  a 
time  hashing  of  a  ilifferent  nattire  than  in  Sei  tion  3.1  riu 

stages  ar«'  numbered  L  =  0.  1 . A  +  1.  tmd  each  of  the 

H  biicki'ts  has  its  own  set  of  stages.  The  hash  table  foi 
each  bni'ki't  for  st  age  A' has  slots.  The  /thslot. 

foi  0  <  7  £  -  1.  represents  the  time  interval 

l.,2*-"'  "Aj-el  The  first  half  of  each  slot  is  called 

the  death  /one.  and  the  second  half  is  called  the  tteiUfrltt 
zone.  For  each  stage,  one  entry  is  put  into  its  ;th  slot 
for  every  death  in  the  death  zone  of  its  ,th  slot,  with  the 
extra  retjuireiiK'iit  that  there  are  no  births  in  the  twilight 
zone  of  the  jt\i  slot;  if  there  is  a  birth  in  the  twilight  zone, 
no  entries  are  placed  into  the  ;th  slot. 

In  athlition.  stages  0  and  A ’  +  1  are  supplemented  as 
follows:  In  stage  0.  an  entry  is  put  into  the  ph.  slot  for 
every  tleath  in  the  death  zone,  regardless  of  whether  there 
have  Ixs-n  no  births  in  the  twilight  zone.  In  stage  h  ^ 
we  move  all  the  entries  into  slot  0  from  the  other  slots. 
We  let  ii'h  iti  J  )  denote  the  slot  occupancy  for  the  ;th 
slot  in  the  time  hashing  table  for  bticket  h  in  the  A'th  stage. 
W<'  tlefine  HT( j)  to  be  the  total  tititiiber  of  entries  in  tile 
/th  slots  of  the  hash  tables  for  buckets  1.2 . H: 

^  "'h  i-(.;  I- 

We  set  the  stage  jiarameter  A'  to  be  A’  =  riglli(  .\/A/  11 

For  completi  ness,  we  should  mention  that  there  is  a 
total  of  four  instances  of  time  hashing,  not  just  the  oiu' 
ilefined  above  The  second  instance  of  time  hashing  is 
tlefined  in  an  identical  way,  except  that  the  time  inter 
vals  of  the  slots  are  otfset  ^2*  from  the  time  intervals 
of  the  instance  tlefined  alrove.  In  addition  to  these  two 
instances,  we  consider  two  "reverse”  instances,  in  which 
tinw  is  viewed  backwards:  we  start  at  time  f  =  I  and 
end  at  time  t  =  0.  and  we  process  each  death  as  a  birth 
and  vice  VC  ,  Without  loss  of  generality  we  shall  rliscuss 
only  the  first  instance  of  time  bashing,  as  defined  in  flu' 
prevhms  jiaragraphs,  and  introdtice  an  extra  factor  of  I 
into  onr  bonn<is.  where  approjiriate 

.A  kev  obwrvation  for  the  ilerit’ation  is  that  the  death 
rate  in  the  .M/G/x  model  is  a  Poisson  ptocess  with  tie 
same  mtc'tisity  as  the  birth  rate  This  follows  because  the 
.M/G/x  motlel  is  sytnmet  ric  and  st  at  ionary.  and  thus  also 
reversible  |Kell,v.  1970)  The  following  lemma  is  the  tiasis 
foi  out  )>r<Hifof  Theorem  3  2 


Leniiiia  3.9.  Ue  have 

max  {  Wa.itf  ( 1 1  [  '  3 
10  1-  1 

II  ' 


in;i\ 
'  JJ  ' 


M-n  /.i 


747 


We  shall  prove  Theorem  3.2  by  bounding  the  sum  in 
Lemma  3.9  by  0(E(  Waste.))  =  0(H).  A  big  difference 
between  this  application  of  time  hashing  and  the  ones  we 
used  in  Section  3.1  is  that  the  random  \'ariahles  Wh.k(j) 
(and  hence  also  are  almost  always  0  as  k-  grows. 

We  have  Prj  J  )  =  0}  S:  1  -  ^  .  This  causes  the 

maximum  slot  occupancy  to  behave  wildly.  In  fact,  to  get 
our  bound,  it  is  not  enough  to  bound  £'(maXj  )}) 

and  then  multiply  by  H.  because  the  result  will  be  too 
large:  the  load  factor  in  the  analysis  of  max^  { le* >(7  i) 
is  too  small,  and  the  ratio  between  tile  average  ma.xi 
mum  slot  occupancy  and  the  average  slot  occupancy  is 
110  longer  0(1).  The  sohition  is  to  consider  the  H  I)uck 
ets  ill  toto  and  to  bound  ^(ina.Xj  { U'*.|7  )} )  directly.  We 
ilo  that  by  computing  the  moment  generating  function 
of  maX||^j<- f  ^  »4.i  -I  { irt(7  )  I  '*■*'1  then  applying  Corol¬ 

lary  3.1  using  Chernoff  s  Itound. 

Lemma  3.10.  The  expected  number  0/  futrifs  in 
stage  A  +  1  is 

£(irx  +  ,(0l)  =  ■^^(  '“‘'X  I  M’s  +  ild  ll )  =  fh  W  ). 

Lemma  3.11.  The  expected  ma.vjnjiini  rxcnpanct  of 
rile  slots  in  stage  k\  (1  <  k-  ''  A  .  is 

£(  inax  ^ 

f: '  ^  -  > 

ifH  =  O  |i  log  .\i' . 

Theorem  3,2  follows  by  ccjuibiuing  Li'mina-  3.9.  3.11. 
timl  3. 12.  and  summing  on  1. 

4.  Conclusions 

The  maximum  'i/e  attained  by  a  iiuenc  .ivei  linn'  i'  <1  ba 
sic  notion  in  stochastic  processes  ami  (pieiieing  tlasiry.  In 
terms  of  data  structures,  if  we  model  the  insertions  and 
deletions  of  elements  as  the  birth  and  death  of  items  in 
;i  i|Ueiie.  then  the  maximum  ipieiii'  si/e  is  the  maximum 
size  of  the  data  structure.  ( )m  conclusions  come  in  two 
forms:  First,  we  have  us<'<l  in  a  natural  way  a  varh’ly  of 
iilgebraic  and  analytical  ticlinitpies  to  obtain  exact  for 
niulas  for  the  flistribution  of  the  maximum  size  of  <|ueues 
for  birth  an<l  <i«'atli  processes  and  for  hashing  with  lazy 
deletion  (HwLDl.  (fur  solutions  ar<'  amenable  to  nuiiier 
ical  calculation  and  some  asymptotics.  The  formulas  for 
several  different  models  are  related  in  that  the  rele\-ant 
riaiisform  m  each  ca.se  can  be  expressed  as  a  ratio  of  das 
sical  oithogonal  polynomials. 

Second,  we  have  aiiswere'l  some  ojieii  <|uestioi|s  in 
'ineuejng  thetiry  using  discrete,  non  ipieneing  liieorv  tech 
m<|ues,  W’e  have  obtained  optimal  lug  oh  bounds  on  the 
e.xpeiteil  maximum  c|Ul'Ue  sl/e  tol  the  .M/C/si.  process 
au'l  f'U  Hw'LD.  \^*‘  pro\'e  foi  HwLD  that  the  expected 
ma.ximuin  amoimt  ot  neeiled  spaci-  itliat  is.  the  nia.xiinmn 
-l/e  of  the  M/fJ,  X  I|ueue.  on  the  tivi'iagei  and  tlie  e.x 
pecteil  ina.xinium  ainomit  of  spaci-  used  by  Hw  LD  abovi- 


the  optimal  amount  are  within  small  constant  factors,  re 
spectively.  of  the  average  space  needed  ami  wasted  at  any 
given  time.  Our  techniques  also  appear  to  Ire  applicable 
to  the  M/M/1  iiKKlel,  which  introduces  several  interesting 
new  facets  to  the  problem. 

Current  work  is  aimed  at  removing  the  the  conditioti 
H  =  AhlogA)  from  Theorem  3.2.  The  proof  technique, 
though,  has  to  be  different,  because  it  is  easy  to  show 
for  H  =  \  that  maXi)<,<i  (  W’ti.'ifi  1 1 ) )  has  unbounded  ex¬ 
pectation.  .\nothei-  prolileni  being  worked  on  is  to  detiT 
mine  the  constant  factors  inherent  in  the  lug. oh  botinds. 
Preliminary  results  suggest  that  the  constants  m  Then 
rems  3.1  and  3.2  are  asymptotically  1  undei  general  con 
ditious. 
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CLASSIFYING  LINEAll  MIXTURES, 

WITH  AN  APPLICATION  TO  HIGH  RESOLUTION  GAS  ClIROMATOGRAl^HY 


William  S.  Rayens,  University  of  Kentucky 


1.  INTRODUCTION 

1.1  Overview 

Consider  g  groups,  each  of  which  can  be  character¬ 
ized  in  terms  of  p  particular  variables.  Suppose  a  test 
observation  y  is  a  “linear  mixture”  in  the  sense  that  each 
of  the  p  variables  associated  with  y  can  be  characterized 
as  a  convex  combination  of  the  corresponding  variables  in 
these  component  groups.  The  weights  defining  this  convex 
combination  will  be  called  “mixing  proportions”.  The  test 
observation  is  “classified”  when  the  mixture  constituents 
are  identified  and  the  mixing  proportions  are  estimated. 

In  this  paper  we  propose  a  model  which  seeks  to  clas¬ 
sify  linear  mixtures.  Section  2  contains  the  motivation 
for,  and  an  outline  of  the  model  development.  Section  3 
contains  the  details  and  results  of  the  application  of  the 
model  to  the  problem  of  identifying  the  constituents  in 
polychlorinated  biphenyl  samples.  Finally,  section  4  con¬ 
tains  a  statement  of  our  conclusions,  and  sectic.".  5  briefly 
mentions  the  computer  routine  used  to  implement  the 
methodology. 

1.2  Application  Context 

Polychlorinated  biphenyls  (PCBs)  occurring  in  the 
environment  of  the  United  States  originate  from  one  or 
more  of  nine  industrial  products:  Aroctors  (registered 
trademark  of  the  Monsanto  Corporal  on).  Each  of  these 
nine  can  be  characterized  by  a  particular  set  of  consti¬ 
tuents  and  their  relative  concentrations  which  are  deter¬ 
mined  by  gas  chromatography.^  These  constituents  differ 
by  the  location  of  chlorine  atoms  along  the  carbon  chain 
a.sftociatcd  with  a  biphenyl  molecule.  Theoretically,  there 
are  200  distinguishable  arrangement.s;  far  fewer  are  gen¬ 
erally  available  in  practice.  Further,  an  environmental  or 
biological  specimen  can  be  characterized  as  a  weighteil 
average  of  the  constituent  concentrations  present  in  the 
<'(,niponent  Aroclors.  in  which  the  weights  are  the  mixing 
proportions.  That  is,  the  chromalograiii  associated  with  a 
mixture  is  essentially  a  weighted  average  of  the  chromato¬ 
grams  as.soeiated  with  the  component  Aroclors  present  In 
f  fnivt  urp. 

We  had  aore.s.s  to  a  training  .set.  consisling  of  six 
runs  for  eafti  of  the  nine  {jure  Aroclors.  I’nifjue, 
ulentifialile  peaks  in  the  rliromalogrums  led  to  the 
development  of  93  conrenlration  variates  that  correspond 
to  relative  eonrentralions  of  individual  I’C'Hs  or  conjoint 
PCBs  that  coelute  ^  lienee,  )  rontains  iV  =  •') I  observa- 
lions  (rows)  and  p  =93  variables  (eoliimns)  Rows  I  f)  of 
)'  rorrespond  to  tfie  rutis  on  Aroelor  1;  rows  7-f2  are  the 
runs  on  Aroelor  2.  etr.  The  aforementioned  reference 


furnishes  the  details  concerning  the  chemical  and  detec¬ 
tion  methods  that  ultimately  led  to  the  training  set  Y 

2.  DEVELOPMENT  OF  THE  MODEL 

2.1  Assumptions 

Suppose  a  random  vector  y  £  Ri"  is  p-variable  normal, 

g 

|«(7)=  V=('7i,  ■  ■  .  .7j)  is  llie 

i-l 

s 

vector  of  mixing  proportions,  so  Xl7j  ~L  ^nd  ~tj>0  for 

1-1 

all  i;  E  is  assumed  to  be  positive  definite,  and  p,- £  Ri"  for 
1  =  1,  ...  ,g.  Our  objective  is  to  estimate  7.  The  model 
assumptions  can  be  subjected  to  criticism  but  in  the  final 
analysis  the  a.ssessment  of  the  methodology  will  be  made 
not  on  the  validity  of  the  model  assumptions,  but  on  how 
well  the  procedures  work  on  real  data. 

In  the  PCB  context  7y  interprets  as  the  concentration 
of  the  ji'’'  Aroelor  in  the  mixture.  Likewise  the  vector  Pj 
represents  the  pure  chromatogram  corresponding  to  the 
Aroelor  and  E  is  the  covariance  matrix  of  the  chroma¬ 
tograms,  In  developing  our  model  we  will  first  consider 
the  covariance  matrix  and  the  pure  chromatograms  to  be 
known. 

2.2  Classification  with  E  and  p,,  .  ,  .  known. 

When  E  and  ....  are  known,  we  can  use  max¬ 
imum  likelihood  to  estimate  7  The  likelihood  function 
associated  with  p(7)  is: 

A(/'(7))  =  l/((2tr)^/2del(E)) 

cxp[-('/'-)(y X~'(v-fi(7))] 

.As  a  function  of  7  the  maximum  of  this  cxprc.ssion  i.s 
achieved  where  ( y-p(7))‘ E~'(y -p(7))  (the  .Mahalanohi.s 
distance  from  y  to  fi(7))  is  miniinizcd.  Define 

Q  =  j7eR':7‘  =('/, . 1,),  V:7,=1.7.>0Vij 

9 

The  restrictions  /t(7)  =  and  7£Q  constrain  p('])  to 

i-i 

a  simplex  having  . p^  as  vertices  flcnre.  a  tiiax- 

inium  likelihood  estimate  of  7,  say  7  ,  is  found  by  locating 
llic  point  on  this  simplex,  /i(7).  Uial  is  closest  to  y  in  the 
.scn.se  of  .Mahalanohis  distance  in  R'' 

.Su|)|)o,sc  /,  =s[>an{E“''''''{/i, -/t, ami  11=  any 
matrix  matrix  wfiosc  criliiitins  fitrm  an  oil hoiioriiia  1  basis 
for  L.  ’I'hc  following  tfieorcm  slows  that  the  t  ra  nsb 'rlii;i' 
lion  /('E’-'''-’:  R'’— rR'^'  rciiiicc.s  the  prolilciiis  of  limling 
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'7  in  terns  of  MAhalanobis  distance  in  to  an  identical 
problem  involving  Euclidean  distance  in 

yotation : 

SP={g~\  )— simplex  having  vertex  set 

z  =  point  on  SP  closest  to  j 

. the  barycentric 

coordinates  ofi  relative  to  SP  . 

Note:  =  = 

Hence,  the  (3— 1)  components  of  t),  represent  the 
coordinates  (with  respect  to  the  basis  B)  of  the 
orthogonal  projection  of  onto  L. 

Theorem  1: 

^  is  a  maximum  likelihood  estimate  of  'y,  sub¬ 
ject  to  -ye  Q.  That  is,  0  =  y  . 

Proof:  We  need  to  introduce  some  more  notation  and 
make  two  observations. 

Additional  Notation: 

«  t 

c-l  1-2 

c,  =  B‘,e, 

»•  =  i;jj 

lb’ll  =  length  of  the  vector  c  ,  in  terms  of 
the  usual  Euclidean  inner  product 

;Vo/c  /:  w*  =  tCj 

1-2 


It  follows  that  B(B‘ B)-'D‘(w*)  =  BB‘(w*]=‘ort.\\ogon-a\ 
projection  of  w*  onto  L”  =  w*. 

Note  2:  if  m*  6  /.  and  in  e  then 

ll«i-,„*l|-  =  llu.-ZJB'  w\\-+[\BB‘  w  -«>‘ll- . 

A  maximum  likelihood  estimator  is  calculated  a.s  follows: 

min((y-y„,)'i:-‘(y-y  )] 

Q 

=  min  [3-y„t)‘S““''‘>i:“"''->{y-y„,)l 

=  min  [(u.4-u>,)-(tc*-(-ii’,)]'[(tc-l-u’,)-(u.'‘-i-i(.|)] 

=  min  [(ui— !C*)’(u.'— !z>*)] 

=  min  llu'~ui*ll^ 
teQ 

=  min  [llu'— BC' uiH’-t-IICB' u'~u'*ll-]  (Note  2) , 

Since  ilin— BB’ uill*^  is  independent  of  y,  the  calculation  of  a 
maximum  likelihood  estimator  reduces  to  finding; 

min  HBB'  •('— w*ll' 

-rf  Q 

=  minllBB'tt'-BB'tc'ir-  (IS'ote  1) 

ie« 

=  min  IIB(,'-r|)-B(c,,^-C|)ll'- 

=  min  lIBc— 

'T  41 

=  min(Bj-Bc,„)'(Bc-B2,„) 

TV 

=  mtn{z-z,J‘B‘B{z-:,J 

T  Q 

=  min 

T  i;i 

=  min  lie— , 

T  Q 

Hence,  must  be  chosen  to  be  the  point  on  SP  that  is 
chmest  to  That  is,  ~z,  and  7  as  claimed  3 

'I'his  result  makes  it  clear  how  7  1  an  be  estuiialed. 
and,  lienee,  what  our  model  will  be  (when  all  the  paratiie- 
ters  //,,  and  i,'  are  known)  (’irsl,  we  form  the  simplex  SP 
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Next,  given  a  pxl  vector  y  as  the  test  observation,  we  cal¬ 
culate  the  (j— l)xl  “score”  ^  and  identify  z 

with  the  closest  point  on  SP,  say  z.  Finally  we  calculate 
the  barycentric  coordinates  of  z  with  respect  to  SP  and 
use  these  as  estimates  of  the  unknown  mixing  proportions. 

2.4  Classification  with  ...  and  £  unknown 
In  practical  situations  such  as  the  PCU  application 
the  group  population  means  (pure  chromatograms),  as  well 
as  S  (the  covariance  matrix),  will  usually  be  unknown. 
However,  we  c^n  still  reach  conclusions  similar  to  those  in 
the  previous  section.  That  is,  suppose  denotes  the 

training  set  mentioned  above,  representing  /V  total  obser- 

vations,  fij  from  the  group,  The  existence  of 

i-l 

this  training  set  permits  the  estimation  of  £  and  all  the 
/r,.  For  instance.  Suppose  /i,  is  estimated  by  y),  the  sample 
mean  of  the  i*’’  group;  and  £  is  estimated  by 
Ss  [l/(n— g)]£,  where  E  is  the  within-groups  sums-of- 
squares  and  cross-products  matrix  associated  with  Y.  If 
we  view  the  likelihood  function  as  a  function  of  q  alone, 
we  can  replace  £  by  5  and  /i,  by  i/,  in  the  above  notation 
and  immediately  reach  the  conclusion  in  Theorem  I 

There  is  an  important  question  left  to  be  answered 
In  the  practical  setting,  where  does  one  find  a  P  matrix 
satisfying  the  above  requirements?  l!y  answering  this 
question,  we  will  establish  a  strong  connection  between  the 
above  ideas  and  linear  discriminant  analysis.  Consider  the 
following  notation: 

ys  granrl  mean  =(  I /.V)^  n,  j/, 

<-i 

/,  =  span{f;"l''''^>(7,-!/,)} 

U  =  V]n,(y,-y)(y,-ir)‘ 

<-l 

H  =  (6,  ■  ft, matrix  whose 

coUimns  arc  tho  (Mgenvertors 

corresponding  to  tlie  —  1  j-nonzero 
eigervalues  of 
B*  =  span{ft, )/”/ 

A/ s  matrix  who.se  columns  are  the  g  1 
eigenvectors  corresponding  to  the 
(y  —  I  bnonzero  eigenvalues  of  E~'l{. 
That  is,  Af  is  the  matrix  which  results 
from  a  standard  linear  discriminant 
analysis. 

(,_!)=  matrix  of  discriminant  scores  =  YM 


The  following  theorems,  proved  in  reference  I,  establish 
the  connection. 

Theorem  2: 

Af‘=B'£-<'/-l,  and  , 

Theorem  3: 

These  results  point  out  that  the  columns  of  B  are  an 
orthonorrnal  basis  for  L  ,  also,  the  transformation  resulting 
from  a  linear  discriminant  analysis  has  the  form 
M‘ =: B‘ Hence,  we  arrive  at  the  following  pro¬ 
cedure  for  estimating  the  unknown  mixing  proportions; 

Step  1  —  form  the  simplex  SP ,  defined  by  the  ver¬ 
tex  set  where  z,  is  the  sample 

mean  of  the  i''''  group  of  discriminant 
scores. 

Step  2 — given  y  is  a  test  observation  (mixture), 
admitting  discriminant  score  z=Af‘y, 
find  the  p.  .nt,  say  f,  on  SP  which  is 
closest  to 

Step  3 —  use  the  barycentric  coordinates  of  z‘.  given 
by  0'  =(,d,,  ,  .  .  ,/3,).  to  estimate  the  unk¬ 
nown  mixing  proportions, 

2.4  Location  of  the  closest  point 

U'e  have  shown  that  the  vector  of  mixing  proportions 
can  be  rigorously  estimated  provided  we  can  find  the  point 
on  a  given  simplex  in  js;  j  losest  to  a  fixed  point 

We  can  adapt  a  fairly  common  nonlinear  pro¬ 
gramming  scheme  (a  "gradient  projection”  technique)  to 
.solve  this  problem.  So  that  our  general  direction  is  not 
lost,  we  direct  the  reader  to  reference  1  for  details.  The 
fact  IS.  the  desired  closest  point  cati  be  fiiund  in  a  (rela¬ 
tively  easy)  Iterative  fashion. 

3.  RESULTS  OF  PCB  APPLICATION 

The  training  set  ),s4.ua  appropriate  for  a 

discriminant  analysis  as  it  stands,  because  the  column 
dimension  is  loo  large.  We  therefore  used  a  principal  com¬ 
ponent  analysis  as  a  preliminary  step  to  reduce  the 
column  dimension  Then,  we  performed  a  linear  discnni- 
Inant  analysis  to  obtain  a  matrix  of  discriminant  scores 
g.  From  this  we  formed  the  •)  vertex  vectors  of  the 
simplex  SP  in  R*  by  calculating  group  means 

For  u.se  in  testing  the  elfeet iveiicss  of  our  model,  we 
had  access  to  a  matrix  consisting  of  several  runs  on  the 
same  lliree-component  mixture  t'sing  methods  of 
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gravimetric  measuring,  pure  samples  of  Aroclors  1,  6  and  7 
were  weighed  in  the  relative  proportions  of  2. 5:2:1  (respec¬ 
tively),  and  then  mixed.  That  is,  the  mixture  theoretically 
consisted  of  45.5%  Aroclor  1,  36.3%  Aroclor  6,  and  18.2% 
Aroclor  7.  Using  methods  of  high  resolution  gas  chroma¬ 
tography,  38  runs  were  made  on  this  mixture,  and  the 
same  93  variates  as  in  the  training  set  were  isolated. 

These  pseudo-unknowns  were  treated  as  test  observa¬ 
tions  and  the  38  corresponding  discriminant  scores  were 
obtained.  These  38  points  in  R*  were  then  identified  with 
the  corresponding  38  closest  points  on  SP .  The  barycen- 
tric  coordinates  of  these  latter  points  were  calculated  with 
respect  to  SP  and  used  as  estimates  of  the  mixing  propor¬ 
tions.  The  classification  results  from  our  model  are  shown 
in  Table  1. 

The  following  comments  highlight  the  important 
2 

features  of  these  results: 

—  The  Aroclors  actually  present  in  the  mixture 
(numbers  1,  6,  and  7)  were  identified  correctly 
(i.e.,  bad  a  positive  barycentric  coordinate)  in 
every  case  except  one  (run  number  5). 

—  Exactly  these  three  Aroclors  were  identified  in 
12  of  the  38  runs.  Of  the  remaining  26  cases, 
the  estimated  contributions  from  .Aroclors  other 
than  1,  6,  or  7  totaled  less  than  1  percent  in  9 
cases.  (Except  for  run  nun.ber  14,  these  small 
contributions  were  always  associated  with  a  sin¬ 
gle  Aroclor  —  namely,  Aroclor  8.)  Thus,  in  32  of 
the  38  cases,  the  contributions  of  .Aroclors  1,  6, 
and  7  are  estimated  to  exceed  95  percent  of  the 
complete  mixture. 

Examination  of  the  pattern  of  the  ESS  measure 
reveals  that  the  first  31  eases  (i.e.,  rows  in  the 
table)  are  all  relatively  similar  in  terms  of  the 
accuracy  achieved;  the  last  7  cases  exhibit  seb- 
stanlially  higher  values  of  the  l.SS.  Pliis  sug¬ 
gests  that  the  eoneent  rat  ion  data  for  these  i 
runs  dilTers  in  some  signifieatit  respect  from  the 
otlier  runs  The  variability  of  the  ela.s.siliea t ion 
results  among  these  7  runs  (last  seven  rows)  sug¬ 
gests  that  several  diirerenl  types  of  atiaiiiolies 
may  be  present  iti  these  runs,  as  opposed  to  a 
single  type  Eive  of  these  ea.ses  correspond  to 
runs  made  late  in  the  study,  as  identified  by  the 
run  tiliriiber  In  fact,  five  of  the  last  six  tripar¬ 
tite  runs  fall  into  this  grou[>  of  seven  (runs  -U 
31.  36.  37.  .'Id).  This  suggests  that  some  type(s) 
of  instriiiiieiit  (li'gradatioii  and/or 


contamination  may  be  responsible  for  the  poorer 
performance  on  these  runs. 

—  Among  the  fust  31  cases  shown  in  the  table,  the 
estimated  percentages  are  quite  consistent,  as 
summarized  in  Table  2. 

—  Despite  the  consistency  of  the  estimates  evi¬ 
denced  in  Table  2,  it  is  clear  that  there  is  a 
discrepancy  of  about  eight  percentage  points 
between  the  estimates  of  .Aroclor  1  and  the  gra¬ 
vimetric  weight.  The  .Aroclor  7  estimates  admit 
a  similar  discrepancy.  The  source  of  the  bias  is 
not  clear,  but  the  consistency  of  the  estimates  is 
encouraging.  It  suggests  that  accurate  results 
may  be  obtainable  by  adjusting  for  the  bias. 
Further  study  is  needed  to  resolve  this  matter 

4.  CONCLUSIONS 

The  results  presented  in  section  3  clearly  support  the 
potential  worth  of  our  model.  Classical  discriminant  tech¬ 
niques  are  principally  concerned  with  the  classification  of 
a  te.st  observation  into  one  group.  .\fany  recent 
methods  —  eg.  SI.MCA®  and  “classification  trees  — are 
also  directed  to  this  purpose.  In  these  methods  statements 
are  available  concerning  the  probability  of  membership  in 
a  certain  class.  However,  it  is  unclear  how  to  translate 
this  uncertainty  into  a  statement  concerning  mixing  pro- 
porlions.  The  model  we  have  developed  is  more  focused 
and  presents  easily  interpreted  results. 

Well-known  constrained  least-squares  techniques  are 
certainly  applicable  to  the  problem  we  have  addressed.  In 
fact,  we  are  e.ssentially  performing  a  least-squares  analysis 
once  the  di.scriminant  .scores  are  obtained.  However,  the 
use  of  discriminant  analysis  to  initially  “best  separate" 
the  groups  is  found  to  be  a  useful  and  intuitive  step 

5.  COMPUTER  ROUTINE 

Our  iiioilcl  employed  computer  routines  iirogrammed 
III  .'■i.A.'s  and  executed  under  .'■'.XS  Relea.se  82  I  at  '1  riangle 
Uiiiversities  ( Vuiipiitat  ion  Center  (Tl  CC|.  Research  I'ri- 
unglc  I’aik.  N  f.h  The  routines  are  extreiiieiv  Ilexihle  and 
essciitiallv  allow  I  lie  entire  procedure  outlined  in  section  2 
lo  be  aiiloiiiat  ically  perfoiliied.  ilieludiiig  the  diseriminant 
analysis  and  a  prinripal  romponeiit  analysis,  if  iieeessary 
or  desired.  Kiirtlier,  several  reliiieinenls  and  met  hodologi. 
cal  exiensions  not  mentioned  in  tins  article  are  available 
a.--  opt  ions  * 
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TABLE  1  —  Classification  Results  for  38  Runs  on  the 
Three  Component  Mixture 
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TABLE  2  —  Summary  of  Consistency  Among  the 
First  3 1  Runs  Listed  in  Table  1 
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BIAS  OF  ANIMAL  POPULATION  TREND  ESTIMATES 


Paul  H.  Geiaeler  euid  William  A.  Link,  U.S.  Fish  and  Wildlife  Service,  Patuxent  Wildlife  Research  Center 


Surveys  of  animal  calls  or  signs  are  often  used  to  monitor 
population  levels  (Seber  1982,  Ralph  and  Scott  1981).  For 
example,  the  Mourning  Dove  Call-Count  Survey  (Dolton  1977) 
is  a  stratified  random  survey  with  more  than  1000  routes.  Each 
spring  biologists  count  the  number  of  doves  they  hoar  calling 
under  standardized  conditions  at  20  stops  along  each  route. 
The  routes  are  used  each  year  without  drawing  a  new  sample. 

Biologists  want  to  know  if  the  animal  population  is 
increasing  or  decreasing  over  some  period.  Annual  mean  counts 
per  route  cannot  be  used  because  changes  in  routes  and 
observers  affect  the  counts.  Instead,  the  slope  of  a  regression 
line  is  used  to  estimate  the  average  trend  on  each  route  over  the 
period  of  interest  and  to  predict  counts  in  years  y-fl  and  y. 
These  trends  are  used  to  estimate  the  ratio  of  the  populations 
in  those  years  (Geissler  1984). 

The  call-counts  on  a  route  can  lx*  modeled  as 


E  EP.ri 

EA.  Ec,  /n.  E  A,  E  r,. /n. 

Thus  f  =  - ; -  =  ^ - .  (5) 

E  A.  E  ^.ry  /  n-  E  A'  E  '.rji  / 

The  back  transformed  adjusted  counts  c,ry  from  (3)  and  (4)  are 
weighted  by  the  strata  areas  At,  where  A#  =  a  N,.  They  may 
also  be  weighted  by  the  inverse  of  the  relative  variance  to 

increase  the  precision  of  the  trend  estimate,  giving  more  weight 
to  routes  that  have  the  smallest  relative  variance. 


(1) 


where  c^^j^  =  observed  call-count,  s  =  stratum,  r  =  route, 
i  =  observer,  y  =  year,  0,^,  =  observer  effect,  Tjr  ~  the  trend, 
and  =  error  term.  Taking  logarithms,  (1)  becomes  a 

linear  regression. 


^Irty  “  *^1-1  4-  y  +  f^riy  (2) 

where  =  ln(c,,.^^-|-0.5).  Quantities  on  the  logarithmic 

scale  are  indicated  by  a  double  prime  to  distinguish  them  from 
the  corresponding  c^nantities  on  the  arithmetic  scale.  Because 
the  logarithm  of  zero  is  not  defined,  an  arbitrary  positive 
constant  is  added  to  (0.5  is  halfway  between  zero  and  the 
lowest  observable  count). 

The  adjusted  predicted  count  in  year  y  is 

c7ry  =  ^ir  +  T'^r  y  (3) 

where  0^r  is  the  mean  of  the  estimated  observer  effects  on  the 
route  r  (see  Searle  1971  Oh.  5,  Searlc,  et  al.  1980).  Ordinary 
linear  regression  provides  the  best  linear  unl)ias€d  estimate  of 
c7ry  from  (3).  Suppose  that  is  an  estimable  linear 
combination  of  the  regression  parameters  on  the  logarithmic 
scale.  Under  the  a.ssumption  that  c7ry  is  normally  distributed 
(counts  are  lognormally  distributed),  Bradii  and  Mnndlak 
(1970)  have  shown  that  the  uniformly  minimum  variance 
unbiased  estimate  (UMVUE)  of  0  =  exp((d’')  is 

it>  =  T(0'')  =  expC^")  gf(  -  vj  (4) 


where  v  =  f  =  error  degree.s  of  freedom,  and  the 

function  g^  is  defined  by 


Sf(i)  =  £ 

P  =  o 


1^  (‘  +  2p) 
f  (f+2)  (f+2p) 


In  particular,  f,ry  =  T(c7fy). 

1  he  pupuiaiion  trend  can  now  be  c  timaled  a®  Ih#*  r  tin  of 
the  total  populations  in  years  y  +  l  and  y  based  on  a  sample  of 
out  of  the  N,  .sampling  units  in  stratum  s.  Assuming  that 
the  predicted  count  c,ry  is  proportional  to  the  local  |>opulation 
P,ry  =  a  k  c,ry,  where  a  is  the  area  of  the  sampling  unit  and  k 
is  a  proportionality  constant, 


E  A.  E  W,,-  E  A.  E  C.rii 

E  A.  E  i.ry  W.rj  “  ^  ^ 


Here  the  weight  w,,.^  =:  £'(f7ry)  /  f'(c7,  The  weights  for  the 
mid  year  arc  used  for  both  the  numerator  and  denominator. 
Note  that  the  weight  depends  only  on  known  values  (route, 
year,  and  observer)  and  not  on  the  variance  tliat  is  estimated 
poorly. 

The  route  is  the  only  randomly  selected  element  in  the 
sampling  design.  Counts  are  repeatedly  made  on  the  same 
routes  without  selecting  a  new  sample.  Therefore  variances 
should  be  calculated  among  routes  rather  than  among  years. 

Bootstrap  confidence  intervals  (Efron  1982)  arc  estimated  for 
the  trend  estimates  (6).  c^r/y.n)’  estimated 

for  each  route.  A  large  number,  B,  of  bootstrap  samples  each 
with  Tit  routes  are  selected  with  replacement  from  the  n,  routes 
in  each  stratum  and  B  bootstrap  replicate  estimates  are  made 
for  a  slate  or  management  unit  using  the  parameter  estimates 
for  the  selected  routes.  The  state  or  management  unit  trend  is 
c.stimated  as  r  and  the  100  a  percent  confidence  intervals  is 
estimated  a-s  r  ±  <7  ^  wlierc  f  and  a  are  the  mean  and 

standard  deviation  of  the  bootstrap  samples,  and  where  I  = 
tabulated  t  value,  n  =  E  n,  is  the  total  numlxr  of  routes,  and 
L  is  the  number  of  strata.  The  bootstrap  trend  estimate  f  is 
reported  to  reduce  the  bias  of  a  ratio  from  order  n~*  to  order 
n”^  (Efron  1982).  B=200  bootstrap  replications  is 

recommended  to  give  an  adequate  approximation. 

In  this  paper  we  examine  the  accuracy  and  preci.sion  of  the 
trend  estimator.  In  the  first  section,  we  report  the  results  of 
simulations  which  investigate  the  performance  of  the  estimator 
under  a  variety  of  conditions.  The  second  section  investigates 
the  performance  of  alternative  estimators  based  on  reduced 
Mean  Squared  Error  (MSE)  estimators.  These  alternative 
estimators  arc  investigated  because  of  the  inadmissibility  of  the 
Bradu  and  Mundlak  UMVUE  (see,  for  example,  Rukhin  1986). 

We  thank  C.  Bunck,  J.  Hatfield,  C.  McCulloch,  and  N.  Coon 
for  reviewing  this  manuscript. 

I.  SIMULATION 

A  factorial  simulation  exixriment  was  performed  to  examine 
the  bias  and  precision  of  alternate  estimators  using  the  GAUSS 
programming  language  (Edlefsen  and  Jones  1086).  The  factors 
(levels)  were  distributions  (3  lognormals,  Boisson,  negative 
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binomial),  trends  (0.95,  1.00,  1.05),  year?,  (3,  5,  10),  routes  (10, 
100),  observers  in  model  (yes,  no),  trend  definitions 
[P(y  +  1)/Pyi  P(y+0,5)/^(y-0.5)l*  adjustment  (no 

adjustment,  mean  of  bootstrap  replicates,  median  of  bootstrap 
replicates).  The  estimators  were  developed  using  the  lognormal 
model,  but  this  distribution  gives  continuous  counts  without 
any  zeros.  Poisson  and  negative  binomial  counts  test  the  trend 
estimation  with  more  realistic  discrete  data  with  zero  counts. 
The  trends  represent  a  stable,  an  increasing  and  a  decreasing 
population  of  birds.  Trends  are  estimated  over  several  pt^riods 
of  time,  ranging  from  2  to  25  years.  Two  year  trends  are  not 
included  because  a  variance  for  the  back  transformation  cannot 
be  estimated.  Varying  the  number  of  routes  checks  to  see  if 
bias  is  reduced  with  increased  sample  size.  Observer  elTects 
may  be  very  important,  but  including  observers  in  tlie  tno<lel 
may  result  in  overparameterization.  The  trend  r  is  defined  as 
the  population  in  year  (y+1)  divided  by  the  population  in  vear 
y  [P^y-j.|\/Py^-  Alternatively  it  could  l>c  ucfincd  aa  tin- 
population  in  year  (y-fO.5)  divided  by  the  population  in  year 
(y-0.5)  [^’(y^O  5)/*^(y-0,5)l*  alternate  definition  is 

centered  on  the  period  of  interest  but  intrtKlucos  another 
parameter  into  the  denominator  of  the  ratio  (  and 

^ir(y-o.5)  “  ^jr).  Tlio  cffcct ivcucss  of  the  bootsttap 

bias  adjustment  is  evaluated.  The  mean  of  the  bootstrap 
replicates  reduces  the  bia-s  from  order  n'*  to  order  If  the 

bootstrap  empirical  distribution  function  forinorl  by  the 
replicates  is  avsymrnetrical.  the  median  may  be  a  Iw'tter 
estimator. 

For  the  lognormal  simulations,  log  counts  were  sample<|  from 
a  normal  <listribution  with  mean  c"  of  3.0  and  staiulanl 
deviations  s'*  of  0.1,  0.5.  and  1.0.  Mourning  dove  call-counts 
have  a  mean  of  about  20  birds  per  route  (ln(20)  as  3)  and  the 
log  transformed  residuals  have  a  standard  deviation  of  about 
0.5.  Poisson  counts  with  a  mean  of  2  birds  per  route  and 
negative  binomial  counts  with  a  mean  of  0.3  birds  per  route 
and  shape  parameter  k=0.5  were  also  used.  We  have  found 
that  American  wocxlcock  counts  in  low  density  areas  arc 
approximately  distributed  according  to  that  negative  binomial 
distribution  and  represent  an  extreme  situation.  A  constant 
(0.5)  was  added  to  the  IVisson  and  negative  binomial  counts  so 
that  zero  count.s  could  Im*  log  transformed  (zero  counts  cannot 
occur  with  lognormal  counts). 

The  sj>ecified  mean  count  c  correspoinls  to  tlie  mean  year  y. 
Years  were  cixled  so  that  y  =  (l.  'Ihe  means  for  otlier  years  y 
were  cr^  where  r  is  the  trend  {0.95.  1.00,  or  1.05).  Kach  year 
there  wa.s  a  0.2  probalnlity  that  the  obs<’rver  chang«<l  <m  the 
route,  though  all  simulated  observer  effects  were  identical. 
lhe.se  effects  were  inrlmled  to  a.s'-ess  the  coiise<ji»»'nre  of 
estimating  tfiese  additional  parameters  which  are  often  re<|ulre<l 
in  practice.  in  the  analysis,  bias  adjustment  (none,  mean, 
median)  are  repeale<l  mea.siires  because  they  result  from  tlie 
same  sifnulale<l  data.  With  10  routes,  2.000  replications  were 
run  for  each  case,  l)ul  with  100  routes.  500  replications  per  case 
were  used. 

I'he  results  of  an  analysis  of  variance  t»f  the  f.actorial 
simulation  ex(>erirnenl  h.ul  many  significant  interactions.  Only 
the  bia.s  adjustment  an<l  estimator  effects  are  uinler  the  control 
of  Ihe  investigator.  I'he  [l*(y-f-i)/l*yl  estimator  w;is  nniforinly 
less  biased  than  the  [l*(y4.o  5)/^  (y-0  5)1  ‘'^•''nalor  and  will  l)e 
adopted.  Both  bias  adjustments  ha<j  small  effects:  neither  was 
consistently  sufX'rior.  Mean  liiases  were  0. 0030-1  with  the 
median  a<ljustmenl,  0.00510  with  the  mean  adjuslinent.  and 
0.00580  wil)»oul  arljusf nu'ut .  Because  these  differences  were 
small  an<)  inconsistent.  Ihe  common  practice  of  using  the  mean 
adjustment  will  be  adopted. 

Becau.se  of  the  .sitmifiraMf  ii)t*'rart Inrtv  rp  .f 

an.ilyzed  separately  (Tables  1  aruf  2).  Biases  for  (lie  lognormal 
distribution  with  s’'=0.1  were  negligible.  With  s''-0.5  and  I.O. 
fitting  observer  j’fferts  wifli  3  f)r  .5  years  is  not  advisable 
l>ecaiise  of  the  large  biases.  I  luTe  tnay  he  lo<»  few  degre«*s  <if 


freedom  to  obtain  a  stable  variance  estimate  for  the  Bradu  and 
Mundlak  (1970)  backtransformatlon.  Otherwise,  the  biases 
seem  to  be  acceptably  small.  The  same  recommendations  apply 
to  the  Poisson  distribution  with  the  addition  that  3  years  may 
result  in  unacceptable  biases.  Ten  year  trends  and  5  year 
trends  without  observer  effects  have  acceptable  biases.  The 
negative  binomial  distribution  represents  an  extreme  situation 
with  a  mean  count  of  0.3  birds  per  route.  Biases  for  that 
distribution  are  unacceptable.  Adding  0.5  to  data  sets  with 
numerous  zeros  biases  the  trends  towards  I.O. 

The  standard  errors  of  the  trend  estimates  are  given  in  Table 
3.  Increasing  the  variance  of  the  counts  of  course  increased  the 
standard  error  of  the  estimate.  Increasing  the  number  of  years 
or  routes  reduced  the  standard  errors  of  the  estimates  as  did  not 
fitting  observer  elTect  which  incrca*^  d  the  elTective  number  of 
years. 


11.  a^OUCKO  MbK  t:Sl  lMA  l  iwN  01' 

Considerable  attention  has  focused  on  the  inadmissibility  of 
Bradu  and  Mundlak's  (1970)  estimator  (leekens  and  Koerts 
1972.  Evans  and  Shaban  197-1,  Rukhin  1986).  Inadmis-sihilily 
results  from  the  possibility  that  exi)(y  )'gni('.5s^).  which 
estimates  a  iii>nnegative  quantity,  can  be  -legalive.  It  is 
therefore  possible  to  construct  estimators  which,  tliough  bia-sed, 
have  smaller  mean  squared  errors  (MSFy)  than  Bradu  «'md 
Mundlak's  UMVUE. 

'reckons  and  Koerts  (1972)  demonstrated  that  for  any  given 
m  there  exist  negative  values  of  t  for  whieh  gni(0  is  negative. 
As  an  illustration  of  this,  consider  the  case  m=l.  whieh  arises 
in  the  problem  under  consideration  when  trends  are  being 
ostimalod  for  a  three  year  period.  It  Is  easily  verified  by 
consideration  of  the  TA^Ior  .sciie.s  fui  the  e(»sine,  that  for  t<  0, 
g,(l)  =  ros{.ri).  Tliijs 

p(  cx|)(y)  g,(-.r)s^)  <  0  )  =  cos  (s  /  .fli)  <  0) 

=  |’((•|j+n  f  <  :^^  <  ('13+^)  5  ;  (o.i.-’.  - )) .  (") 


so  that,  since  h<i.s  a  chi-s<|uaro  distribution  with  one 

<iegree  of  freedom  and  cumulative  distribution  denoted  by 
\j(-),  we  find  that  the  probability  of  a  negative  estimate  is 


■-<  r 


\\ 


I  ) 


i^) } 


Some  values  of  this  are  given  in  the  fi>lIowing  table.  For  small 
values  of  rr  the  probability  of  a  negative  estimate  is  not  too 
large,  luit  a.s  a  iiirrea.s<*s.  the  probabililv  a}>pr<>aches  1/2. 
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.s  might  be  attributable  to  large  variability  in  Ihe 

rlenommaior. 

we  corjsidered 

alternative  estimators  which 

utilizcf!  redure<l  MSI'  estimators  in  place  i>f  the  rMNTE. 


Let  X  and  Y  be  independent  random  variables,  X— ^ 

We  consider  estimators  of  the  form 

T(X,Y)  =  Y  (8) 

for  the  parameter  9  =  e^.  Evans  and  Shaban  (1976)  express 
the  MSE  of  estimators  of  this  form  by 


MSe(  T(X,Y)  )  =  e(  Y 


itiib  {alter  equality  is  established  by  firsi  expaiiut.ig  the 
binomial  and  then  completing  the  square.  Since 

e(  e*^  j  =  exp  I  a/i  +  ^  a‘  (T^j  . 

we  have 


which  would  suggest  tlial  the  choice  of  Y  to  minimize  (9) 
should  be  an  estimator  of  exp|  — 1.5 

An  estimator  of  the  form  y  — 1.5  s‘>  i.v  therefore 

attractive  on  two  counts:  1)  it  has  trie  potential  of  reducing 
MSE.  and  2)  it  is  non-negative.  In  order  to  evaluate  the 
l>efformance  of  such  estimators,  simulations  of  the  trend 
estimator  with  3,  5,  and  7  years  and  3,  5,  and  7  routes  were 
performed.  The  logarithms  of  the  counts  were  normally 
distributed  with  standard  deviations  of  0.75,  1.25,  and  1.75; 
and  mean  of  zero.  A  stable  population  was  simulated. 
Observer  effects  were  no;  included  in  the  simulation.  Each 
combination  wa-s  rcplicatco  200  times.  The  results  are 
summarized  in  Table 

As  previously  indicated,  the  l)ia.s  of  the  estimator  based  on 
the  UMVUE  increased  with  the  underlying  standard  deviation. 
This  appears  to  be  the  case  as  well  with  the  estimator  ba-.ed  on 
the  reduced  MSE  version,  though  the  bias  in  (lie  latter  case  is 
negative.  Also,  in  the  latter  case,  the  magnitu<ie  of  the  l>ias  <li<l 
not  decrease  with  increased  rmmlw’r  of  routes  or  years:  in  fact, 
it  appears  to  increase.  While  the  standard  deviation  of  the 
second  trend  estimator  wa.s  generally  smaller  (not  included  in 
the  table),  the  decrease  in  size  could  not  offset  the  sizable  bias. 
For  these  reasons  we  do  not  rerommen<l  the  reduee<l  MSE 
estimator. 

nr.  coivcLi/sioivs 

♦The  trend  estimator  should  be  used  instead  of 

the  5)/^*(y-0  5)  estimator  beraus<  it  is  Irss  biased, 

although  it  nas  the  tlisadvantage  of  not  being  centered  on 
the  interval  . 

•The  effect  of  l)ootslrap  bias  adjustmenl  w.ts  small  and  not 
consistent . 

♦The  bias  increased  with  an  increase  in  *li^  variance  of  « 
dl'-tr’b  i,  -on 

♦  With  lognormal  [c*’=3  (csss'^O  birds),  s’'>0.5]  and  l*oiss4»n 
(f=:2  birds)  counts,  Htiing  observer  effects  with  les.s  than  5 
annual  observations  is  not  recommended  l)eranse  of  the 
bias.  If  observer  effects  are  l>elieved  to  l)e  im|K)rlanf, 

trends  should  not  l)e  estimated. 


♦With  Poisson  (c=2  birds)  counts,  it  is  not  advisable  to  fit 
trends  with  less  than  5  annual  observations. 

♦Negative  binomial  (c=0.3  birds,  k=0.5)  counts  represent  an 
extreme  situation  where  trend  estimates  are  not  advised. 

♦Otherwise,  the  bias  of  estimated  trends  seems  to  be  acceptably 
small  in  the  situations  considered. 

♦Adding  a  constant  to  the  Poisson  and  negative  binomial 
counts  biased  the  trend  estimates  towards  one  but  is 
neceSvSary  because  the  logarithm  of  zero  is  not  defined. 

♦The  reduced  MSE  backtransformation  is  not  recommended 
because  the  bias  does  not  decrease  with  an  increase  in 
sample  si7<» 

REFERENCES 

Bradu,  D.  and  Y.  Mundlak.  (1970),  “E-stimation  in  lognormal 
linear  models,"  J.  Am.  Siai.  Assoc.  05:198-211. 

Dolton,  D.  D.  (1977),  Afoumin^  dove  stains  report.  1976. 

Spec.  Sci.  Rep.-  Wildl.  No.  208.  U.  S.  Fish  k  Wildl.  Serv., 
Washington,  D.  C.  27  p. 

Edlefscn,  L.  E.  and  S.  D.  Jones.  (1986),  GAUSS  programming 
language  manual.  Aptech  Systems,  Inc.,  Kent,  WA.  466  p. 

Efron,  B.  (1982),  The  jackknife,  the  bootstrap  and  other 
resampling  plans.  Society  for  Industrial  Applied 
Mathematics,  Philadelphia.  92  p. 

Evan.s,  I.G.,  and  S.A.  Shaban.  (1974),  “A  note  on  estimation 
in  lognormal  models,"  J.  Am.  Siat.  Assoc.  64:632-636. 

Geissicr,  P.  II.  (1984),  “Estimation  of  animal  population 
trends  and  annual  indices  from  a  survey  of  cal). counts  or 
other  indications,"  Proceedings  of  the  Amencan  Statistical 
Association,  Section  on  Surrey  Itesearch  Methods  p.  472- 
477. 

Ralph,  C.  J.,  and  J.  M.  Scott  (eds.)  (1981),  Estimating 

numfter.s  of  terrestrial  birds.  Studies  in  Avian  Biology  No. 
6.  Cooper  Ornithological  Society,  Lawrence,  KS.  630  p. 

Rukhin,  A.L.  (1986),  “lu.^/iovcd  iaLlmulion  in  lognoiinal 
models,"  J.  Am.  Slat.  Assoc.  81:1046-1049. 

Searle,  S.  R.  (1971),  Linet  r  models.  Wiley,  New  York.  532  p. 

Searic,  S.  R..  F.  M.  S[>eed,  ind  G.  A.  MiDiken.  (1980), 
“Population  marginal  means  in  the  linear  mcnlel:  an 
alternative  to  least  square::  means,"  .'Imcrican  .'^/(ifi.s/iciarj 
34:216-221. 

Sel)er,  G.  A.  F.  (1982),  The  estimation  of  animat  ahundaiu  e 
and  related  parameters.  Macmillan,  New  York.  651  p. 

Teckens,  R.,  an<l  J.  Koerts.  (1972).  “Some  statistical 

implications  of  the  log  transformation  of  multiplicative 
models,"  Eronernrtrira  40:793-819. 


'Fable  1.  Significance  levels  (P)  for  elTect.s  in  separate  analyses 
of  v.kiiiAiKes  4>f  )>ia.se.s  in  .simulation  (‘X|HTiiiient  for  eaiii  (.ouitl 
distribution.  Values  of  P<0.05  and  l*<().ni  are  flagged  by  -f- 
and  ♦,  respectively. 


Effort. 

I. nor  0. 1 

Lnor  0.5 

I.iior  1.0 

Poisson 

N.llinom. 

Troncl 

(1.012.!  + 

0.082.') 

0.0210  +  0.9721 

0.0001 

♦ 

^'oars 

0.0702 

fl.S.M.I 

n.2:!7.| 

0.0007  . 

0.0001 

* 

Houloa 

0.6291 

0.7192 

o.r>oi 

n.,6r>'.)-i 

fl.:i,6.68 

OI>s. 

O.l.VJl 

0.2708 

0.0172  + 

0. 1  :i29 

0.000.6 

* 

1  X  Y 

O.IMOS  + 

n.O.'l.'iti  + 

0,0218  +  0.1-182 

0.0106  + 

1  X  U 

O..|fl01 

fl.O.ISS 

0.97.H 

0 

0.9600 

T  X  (J 

0.08:16 

0.02.67  + 

O.O.'l-ll  +  0.00,61  . 

0.0161 

+ 

Y  X  II 

0.7876 

0.8926 

0.6181 

0,6'206 

0.1)006 

Y  X  O 

O.IOSS 

0.969.6 

0.2267 

n.2,6'20 

0.0262 

H  X  O 

0.6291 

0.667.'! 

0.6101 

0.7211 

0.6791 

Table  2.  Mean  biases  of  trend  estimates  from  simulation 
experiment.  EfTects  of  the  number  of  routes  are  not  included  in 
this  table  because  they  were  not  significant  (Table  I). 


Lognormal 

log  mean=3.0  (~20  birds),  log  standard  deviation  =  0.1 
years(observcr) 


trend 

3(y) 

3(n) 

5(y) 

.5(n) 

10(y) 

10(n) 

0.95 

-0.0010 

-0.0005 

0.0000 

0.0000 

0.0000 

0.0000 

1.00 

0.0010 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

1.05 

-0.0010 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Lognormal 

log  mean=3.0  (20  birds),  log  standard  deviation  =  0.5 
years(obsorver) 


trend 

3(y) 

3(n) 

5(y) 

5(n) 

10(y) 

10(11) 

0.95 

-0.0120 

0.0015 

0.0085 

0.0010 

0.0040 

0.0000 

1.00 

0.0345 

0.0000 

0.0080 

-0.0010 

0.0010 

0.0000 

1.05 

-O.OOGO 

0.0020 

0.0000 

0.0025 

0.0020 

0.0000 

Lognormal 

log  mcan=3.0  (20  birds),  log  standard  deviation  =  1.0 


years(obscrver) 


trend 

3(y) 

3(n) 

■'>(y) 

5(n) 

10(y) 

10(n) 

0.95 

-0.0010 

0.0060 

0.0365 

-0.0015 

0.0115 

0.0005 

1.00 

0.1920 

0.0095 

0.044U 

0.0015 

0.0020 

-0.0005 

1.05 

0.0065 

0.0025 

0.0030 

0.0005 

0.0105 

0.0005 

Poisson 

mcan=2.0  (birds) 

years(obscrvcr) 


trend  3(y)  3(n)  5(y)  5(n)  10(y)  I0(n) 


0.95 

-0.0380 

-0.0090 

-0.0225 

0.0065 

0.0030 

0.0060 

1  nn 

-0.0270 

-0.01.50 

-0.00.50 

-0.0025 

-0.0025 

-0.0005 

1.05 

0.0163 

'.‘.0290 

-0.0115 

-0.0080 

-0.0060 

-0.0000 

Negative  Binomial 
rnean=0.3  (birds) 

trend  3(y)  3(n) 

ycars(o^'^'’r 

5(y) 

-r) 

5(n) 

I0(y) 

10(n) 

0.95  0.0680 

0.0.570 

0.0440 

0.0065 

0.0445 

0.0370 

1. 00  0.0740 

0.0285 

0.0.305 

0.0030 

-0.0020 

0.0005 

1.05  -0.0005 

■0.0110 

-0.0375 

-0.0330 

-0.0305 

-0.0375 

Tabic  3.  Standard  Errors  of  trend  estiniates  from  simulation 
experiment. 


Lognormal 

log  mcan=3.0  (20  birds),  log  standard  deviation  =  O.I 


years(observcr) 


trend 

3(y) 

3(11) 

5(y) 

5(11) 

10(y; 

lQ(n) 

0.95 

0.0000 

0.0005 

0.0000 

0.0000 

0.0000 

0.0000 

1.00 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

1.05 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Lognormal 
log  mcan=3.0  i 

trend  3(y) 

(20  birds),  log  standard  deviation  =  0.5 

ycars(obscrver) 

.3(n)  5(y)  5(n)  I0(y) 

10(n) 

0.95 

0.0020 

0.0015 

0.0005 

0.0000 

0.0000 

o.onnp 

1.00 

0.0005 

0.0030 

0.0010 

0.0000 

0.0000 

0.0000 

1.05 

0.0000 

0.0010 

0.0020 

0.0015 

0.0000 

0.0000 

Logiioniio) 

log  mcan=3.0  (20  birds) 

,  log  standard  deviation  =  1.0 

>Tars(observer) 

trend 

3(y) 

3(11) 

5(y) 

5(n) 

10(y) 

10(n) 

0.95 

0.0130 

0.0020 

0.0015 

0.0005 

0.0015 

0.0005 

1.00 

0.0460 

0.0015 

0.0400 

0.0015 

0.0010 

0.0005 

1.05 

0.0085 

0.0005 

0.0080 

0.0015 

0.0005 

0.0005 

Poisson 

mcan=2.0  (birds) 

trend  3(y)  3(n) 

yrars(ol>scrver) 

5(y)  5(11) 

in(y) 

10(11) 

0.95  0.0030 

0.0010 

0.0005 

0.0005 

0.0010 

0.0000 

1.00  0.0000 

0.0020 

0.0010 

0.0005 

0.0005 

0.0005 

1.05  0.0015 

0.0030 

0.0005 

O.OUID 

0.0000 

0.0000 

Negative  Binomial 

inean=0.3  (birds) 

1 » cr) 

trend  3(y) 

3(11) 

5(y) 

5(n) 

10(y) 

10(11) 

0.95  0.0040 

0.0010 

0.0010 

0.0005 

0.0005 

0.0000 

1.00  0.0020 

0.0025 

0.0085 

0.0000 

0.0010 

0.0005 

1.05  0.0065 

0.0010 

0.0005 

0.0010 

0.0005 

0.0005 
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Table  4.  Bias  of  trend  estimators  using  UMVT'E  and  reduced  MSE  estimators  of  e^*. 

UMVliE  REDl'CED  MSE 


Years 

Routes 

SJT 

Bias 

SEfRiasl 

Bia-s 

smiiasl 

3 

0.75 

0.103 

0.029 

-0.084 

0.029 

1.25 

0.096 

0.065 

-0.108 

0.056 

1.75 

5 

0.75 

0.032 

0.018 

-0.121 

0.019 

1.25 

O.OIl 

0.051 

1.75 

0.329 

0.141 

0..344 

7 

0.75 

0.083 

0.017 

0.017 

1.25 

0.018 

1.75 

0.629 

0.215 

0..508 

0.145 

5 

3 

0.75 

0.010 

0.010 

1.25 

0.022 

0.021 

-0.092 

0.020 

1.75 

0.029 

-0.168 

0.029 

5 

0.75 

0.012 

0.008 

-0.042 

0.008 

1.25 

-0.006 

0.016 

-0.111 

0.015 

1.75 

0.060 

0.028 

0.030 

7 

0.75 

0.002 

0.007 

-0.049 

0.007 

1.25 

0.038 

0.014 

-0.080 

0.013 

1.75 

0.010 

0.021 

-0.138 

0.021 

7 

3 

0.75 

■0.003 

0.006 

-0.023 

0.006 

1.25 

0.004 

0.011 

-0.046 

0.011 

1.75 

0.003 

0.015 

-0.087 

0.014 

5 

0.75 

-0.003 

0.005 

-0.024 

0.005 

1.25 

0.009 

0.008 

-0.040 

0.008 

1.75 

0.013 

-0.062 

0.013 

7 

0.75 

-0.005 

HiliS  M 

0.004 

1.25 

0.015 

0.007 

1.75 

0.019 

0.011 
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Tlu'  Elimination  of  Quantization  Bias  usint;  Dither 
Douglas  M.  Droher  and  Martin  .1.  (iarbo,  Hiiglies  Aircraft  (,'ompany 


I.  Introilurtioii 

Tills  pajx’r  jiresents  a  method  for  recovering  the 
decimal  jirecision  of  a  non-observable  variable  that 
has  been  (jnantized.  The  terhniiine  involves  adding  a 
random  variate  (dither)  from  a  uniform  distribution  to 
the  variable  prior  to  quantization.  It  then  shows  the 
conditions  under  which  the  expi'ctation  of  the  dithered 
quantization  function  equals  the  value  of  the  variable 
in  (piestion.  An  expression  for  the  variance  of  the 
dithered  quantization  function  is  also  ilerived.  The 
results  are  generalized  to  the  multiple  — quantization 
rase.  Examples  are  presented  which  show  the 
ap))liraiion  of  this  technique  to  reduce  the  magnitude 
of  bias  error  rausi'd  by  roundoff. 

II.  Methodology 

.Sujqiose  that  it  is  desired  to  ('stimate  the  actual 
value  of  a  variable.  F.  when  it  ran  be  obsen.'ed  only 
after  being  quantized,  .'specifically,  if  q  is  the 
quantization  interval,  or  distance  between  surce.ssive 
quantum  levels,  then  F  may  be  represented  as 

¥  =  nq-¥:  ( J ) 

where  ti  is  an  integer  and  |  c|<(/.  If  the  quantization 
interval  is  1.  ihen  c  is  the  fractional  part  of  F.  Thus, 
the  problem  ran  be  reduced  with  no  loss  of  generality 
to  that  of  estimating  ;  given  its  (piantized  value.  Q(c|. 

.■V.ssume  that  the  non -ob.servtibh'  cariable  c  is 
quantized  as  follows; 

Q(c)  =  INTjr-t-O.Sj  (2) 

where  Q(c|  i'  the  (piantized  value  of  .:  and  I.NTyi  5 
largest  integer  ^  z.  Fig.  1  illustrates  this  (juaittization 
function. 

Furthermore,  define  a  dither  density  function  /(z) 
to  be  a  uniform  densi’y  function  such  that 

I  1.  ()..j  ^  z  <  : +0.5 

/(/)  {  -1<:<1.  (3) 

1  0.  el-ewhere 

When  r.indom  variable  z  from  this  density  is 
added  to  jirior  to  (piant izat ion  it  then  follows  that 
the  expectation  of  Q(c).  /(g(c).  iiiav  be  expre.ssed  as 


Pg(  :| 

f  Q|z)/(z)(/z 

.X 

(4a) 

ij.rj 

-  r 

(4b) 

(4c) 

atid  the  variance  of  Q(  ').  (Tgdc).  as 


J  |Q(z)lV(z)(/z  -  pg'(r) 

X 

J  •  (i.r, 

(5a) 

J  lQ(^))'rf^  -  --■ 

(5b) 

b(i-l-'l). 

(5r) 

The  simplified  expre.ssioiis  for  /^q(c)  and  <Zq|c) 
result  from  the  method  of  cpiantizatioii.  In  particular, 
(2)  generates  equal -size  steps  symmetric  about  the 
origin  as  shown  in  Fig.  1. 


Figure  1.  Symmetric  equal  ste])  (piantization. 


It  is  ronr('ivable  that  ii  variable  might  undergo 
multiple  (piantizations  prior  to  its  utilization.  In  such 
rases  the  cpiestion  aris<'s  as  to  the'  extent  to  which 
dither  should  be  applied.  For  ('xample.  a  variable 
could  be  (piantized  to  quarter-unit  pn-rision  and  then 
be  rounded  via  (2)  to  unit  precision.  This  jirocess 
results  in  the  (piantization  diagram  of  Fig.  2.  It  should 
be  noted  that  this  diagram  is  eipiivaleiit  to  that  of  Fig. 
1  shifted  to  the  left  by  0.125.  This  shift,  or  bias,  is 
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caused  by  the  upward  rounding  of  values  midway 
between  quatitization  intervals:  in  this  rase,  the 
rounding  of  0.135  to  0.25,  etc.  Applying  (3|  and  (4)  to 
this  quantization  function  results  in 

=  c^().125.  (6) 

The  'eas  in  (6)  is  eliminated  by  redefining  the  dither 
detisity  function  as 

I  1.  c- 0.025  ^  J-  <  ; +  0.375 

\  0.  elsewhere 

This  l)ia.s  ran  also  be  eliminated  by  applying  a  dither 
with  density  function 

I  4.  -'-0.125  ^  I  <  c+0.125 
/(r)=j  -1<.'<1 

\  0.  elsewhere 

I>rior  to  quarter-unit  quatitization  and  then  applying 
a  dither  with  density  function 

/  1.  ;  0.5  ^  I  <  -'+0.5 

/(/)=j  -!<-'<! 

\  0.  elsewhere 

jirior  to  unit  quatitization.  Either  nu'thod  results  iti  an 
iinliiased  estimation  of  c. 

Any  unbiased  estimator  of  ;  will  have  an 
associated  discrete  probability  density  futirtion.  For 
-l<c<0  the  discrete  values  will  be  -1  and  0.  For 
()<.:<!  the  valms  will  be  0  and  1.  In  either  case  the 
variance  of  this  function  deiieiids  on  c  as  giveti  in  (5r). 
Fig.  3  plots  the  standard  deviation  of  the  density 
function  for  t/  =  1.  This  plot,  which  rotisists  of  half- 
circles  symmetric  about  the  error  a.xis.  shows  the  error 
increasing  from  zero  at  eacli  quantum  level  to  0.5 
halfway  between  ipiantum  levels. 

The  aiiplication  of  dither  may  be  generalizeii  to  n 
sucr<'s-:ive  'puiiit izat ions  with  (plant izat ion  intervals 
(/|.  (/,,.  •  ■  •  .  (/^  .  If  a  dither  is  a|iplied  before  each 

(piantizatioii  then  the  re~])ertive  dither  density 
funct  ions  a  re 

' 

r,:-4,y,.  =■  '' 

0.  elsewhere 

for  1  -  !  2.  ■  .  II .  where  .  This  gives  an 

unbiased  e  tiniator  /'qIH  with  variance  given  by  (.'>r) 
with  ij  ~  (/^  .  However,  a  single  dither  can  be  used 
befort  tlie  first  (jiiantization  if  the  introduced  bias  can 
be  removed.  For  ii  (plant  izat  ions  with  a  'ingle  dither 
iroiii  rj.:  :  •  — (/  j.  the  resulting  biti'  is 

7<V|  -  V,  t  ■  •  •  t  I  b 

lo  remove  this  bias  the  dither  (leii'ily  function  is 
re(l(Tmed  as 

'  l'  7'!,.  t('/i  ^4  •  ■  ‘  t/,  |b 

'  ^7‘l,.  t('/,  t'/4  •  ■  '  ,)|- 


Figure  3.  Quantization  error  standard  deviation. 

A  dither  witli  this  density  function  is  referred  to  a.s 
eounolidated  dither.  With  the  bias  r( moved  we  now 
have  an  unbiased  estimator  with  variance  given  by 
(5c)  with  9  = 

Dejiending  on  the  method  of  rounding  used,  the 
application  of  dither  can  introduce  different  biases 
when  negative  numbers  or  niinibers  close  to  zero  are 
involved.  A  discussion  of  various  rounding  methods, 
introduced  biases,  and  methods  for  avoiding  the.se 
biases  is  included  in  an  aiqiendix. 

III.  .Application 

Siijiiiose  we  have  two  coniputers.  A  and  B.  where 
A  is  generating  a  series  of  numbers  which  are  then 
sent  to  B.  The  numbers  in  the  series  are  floating¬ 
point  which,  for  simplicity,  are  assumed  to  be 
restricted  to  the  int('rval  iO.lj  with  C  significant  digits 
of  precision.  .Also,  for  simplicity,  it  is  assumed  that 
during  the  time  of  interest  each  of  the  numbers  is 
('(piai  to  a  constant.  This  is  not  a  requirement  for  the 
method  to  work,  luit  makes  it  easi('r  to  grasp  the 
concepts  invoivt'd. 

Due  to  the  limited  iiKUnory  and  pr(>C(.-dng  speed 
of  computer  B.  it  cannot  handh'  th('  series  of  numbers 
iti  its  original  floating  iioint  format,  but  ran  handh' 
only  integers.  Thi'reforc.  romiuiter  A  rounds  each 
number  to  the  nearest  integer  (0  or  1)  before  sending 
it  to  B.  B  acruimthitcs  the  series  of  mimb(  rs  and 
reports  the  running  sum  id'tcr  ('acl.  inimber  is  added 
Since  all  of  the  numbers  in  the  series  tire  the  same  and 
are  therefore  rounded  to  the  same  integer,  either  0  or 
1.  a  constant  error  is  introduced  This  produces  a  bias 
which  is  accumulated  by  Ii  and  results  in  an  (rror  in 
the  sum  which  increases  with  each  number  ad.h'd, 

I  bis  js  utiacceplable.  but  if  coiiquiter  Ii  iiiiinot  be 
upgrtided  or  replaced,  then  onh'  integers  can  be  sent 
from  .A  to  B.  f'ortumitely.  there  is  a  way  to  remove 
the  bias  in  the  accumulated  sum.  at  the  expeiisi  of 
introducing  some  vtiriance  in  the  vabii'  of  tli(  sum:  we 
tidd  a  rtitidom  ditlu-r  to  the  iiuiiibers  belore  rounding 
them.  Since  integer  rotinding  is  used,  the  diliier  should 
be  from  l'|-  b-5.  C'O.o].  where  :  is  the  ctirrent 
tiutiilx  r  to  be  routided. 

I'ig.  4  shows  the  actmd  ;md  true  (tio  roundingl 
arciimiilaled  sums  for  a  series  of  one  hundred 
numbers,  each  etpial  to  (I..")  tmd  rounded  u]i  to  1.  Fig. 
.5  shows  the  saiiK  s(  ries  with  dither  ajiiilied  before 
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Figure  4.  Arouiiiulated  sum  of  a  series  of  ](K)  numbers, 
earh  eqiial  to  0.5. 


Figure  5.  Accumulated  sum  of  a  series  of  100  numhers. 
each  equal  to  0.5  plus  a  dither  generated  from 
rj;-0.5.  .'40. 5|. 


rounding,  and  Fig.  G  'hows  the  accuiinilated  error.  \ 
marked  imjtrovement  can  Ik'  seen  when  dither  is 
added. 

Now  we'll  coinjilicate  the  jiroldem  hy  adding 
another  comimter.  f.  lietween  A  ami  B.  ('onii>uter  (' 
ran  handle  numhers  with  two  fractional  hits  of 
precision:  therefore.  A  rotmds  numhers  to  the  nearest 
multiple  of  0.35  before  sending  them  to  wliirh  in 
turn  rounds  them  to  int<'gers  and  sends  them  to  D. 
which  accumulates  tliem  as  before.  Tlie  arriimulated 
sum  will  still  havi'  a  bias  that  ran  he  removed  using 
dither.  We  could  apiily  ;■  r(c-0.125.  r+0.125)  tlither 
in  A  before  rounding  to  the  nearest  multiple  of  0.25. 
follovved  hy  a  rjc-0.5.  .0.5]  dither  ap|)lied  in  (' 

before  rounding  to  the  neari't  integer  {doiihh  tiilliir). 
However,  in  Section  II  it  was  shown  that  a  single 
consolidated  dither  c.'iii  he  ajiplied  instead,  with 
density  funrt  ion 

ri:  0.5  (1125.  :  40.5  0.125] 

'  rj:  O.G25.  :  4  0.5:5,'. 

This  dither  i'  a|)plied  in  A  before  rounding  to  the 
nearest  multiple  of  0  25.  Fig.  7  shows  the 
arrimiiilated  siiin-  for  no  dither,  double  dither,  and 
consolidated  dither,  along  with  the  true  sum.  Both  the 
double  dither  and  consolidated  dither  remove  the  bias, 
but  the  consolidated  dither  reipiires  less  computation. 


Figure  7.  Acruniiilated  sums  with  no  dither,  double 
dither,  and  consolidated  dither  a]>i)lied  to  a  series  of 
100  numbers,  each  consisting  of  0.5.  rounded  to  the 
nearest  multiph'  of  0.25.  then  rounded  to  the  nearest 
integer. 


n  .  Summary 

This  jiaper  develojied  tin  apjilication  of  dither  to 
n-cover  lost  jirecision  and  reduce  biases  introduced  by 
ipiantization.  Doth  the  siiiiide  case  with  one 
quant  i/at  ion  of  a  variable  and  the  rase  of  multiple 
<|uant  i/at  ions  were  considered.  In  all  rases  the 
dithering  ti'cliniijue  results  in  an  unbiased  esiimate  of 
an  unobservable  variable  in  addition  to  knowledge  of 
the  variance  about  the  estimate.  The  methodology  is 
illustratecl  in  a  computer  rommunirations  application. 

Appendix 

Care  must  be  taken  when  using  dither  to  avoid 
introducing  a  bias  when  rounding  niitiibers  that  are 
close  to  or  less  than  zero.  Ditlu'r  retpiires  etpial  si/e 
(|nanti/ation  intervals  for  unbiased  results.  Depending 
on  the  rounding  method  used,  the  quanti/ation 
intervals  may  or  may  not  be  of  e<|ual  size. 

There  are  several  methods  of  rounding  likely  to  be- 
use<l  on  roinimters.  earh  rounding  values  to  the 
nearest  integer  (integer  rounditig  is  assumed  here  for 
simplicity).  They  differ  in  the  way  they  treat  vahies 
with  fractional  ))arl  0.5  The  first  method,  which  we 
will  call  iiorniiil  roinidi’ig .  is 

g(r)  -  Sgn(.')-1NT|(  i  .0.5]. 
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AN  ALTERNATE  METHODOLOGY  FOR  SUBJECT  DATABASE  PLANNING 


Henry  D.  Crockett,  Mark  E.  Eakin,  and  Craig  W.  Slinkman 


ABSTRACT 

An  Important  aspect  of  data  administration 
is  strategic  data  planning.  Strategic  data 
planning  Is  the  scheme  which  an  enterprise 
uses  to  ensure  that  Its  Information  systems 
function  can  support  the  managerial 
objectives  of  the  enterprise.  An  Important 
component  of  strategic  data  planning  Is  the 
determination  of  the  subject  databases 
needed.  James  Martin  has  suggested  a  simple 
ad  hoc  procedure  for  performing  this 
analysis.  An  alternative  procedure  is 
suggested  using  SAS  to  perform  a  multi¬ 
variate  statistic  technique  called  corres¬ 
pondence  analysis.  This  technique  has  the 
advantages  that  it  has  a  strong  theoretical 
justification,  yields  a  numeric  measure  of 
the  strength  of  the  subjective  database 
clustering,  is  well  understood,  and  is 
relatively  simple  to  include  in  CASE 
software. 

INTRODUCTION 

James  Martin  in  his  book.  Strategic  Data- 
Planning  Methodologies,  presents  an 
organized  method  by  which  organizations  can 
design  their  data  resources  to  meet  their 
long  term  information  needs.  A  corporation 
determines  what  processes  it  must  carry  out 
in  order  to  thrive  in  business,  then  they 
determine  what  data  might  be  needed  in  order 
to  support  these  processes.  The  process 
model  that  Martin  develops,  called  an 
Enterprise  Model,  is  quite  similar  to  IBM's 
Strategic  Business  Plan.  However,  IBM's 
plan  docS  not  provide  Che  detail  and  the 
organized  method  needed  to  make  their  plan 
fully  useful.  Martin  uses  his  Enterprise 
Model  to  determine  the  data  requirements  of 
Che  organization,  and  to  divide  these  data 
requirements  into  subject  databases  that 
contain  all  of  the  information  needed  about 
specific  entitles  in  the  corporate 
environment.  By  this  method  he  hopes  Co 
eliminate  duplicate  information  and  effort 
in  programming  and  planning. 

Once  the  Enterprise  Model  and  subject 
databases  are  determined  they  must  then  be 
separated  into  operational  subsystems  for 
implementation  on  a  piecemeal  basis. 

Martin's  methodology  for  accomplishing  this 
seems  to  be  inadequate,  and  open  to  multiple 
interpretation  and  errors.  This  paper  will 
address  these  problems  and  present  an 
alternate  method  for  determining  these 
operatlonalizable  subsystems. 

The  Enterprise  Model 

The  enterprise  model  is  a  top  down  view  of 
all  activities  which  need  to  be  performed  in 
order  to  have  a  functional  organization. 

All  organization  activities  can  be  described 


as  processes  carried  out  by  different 
functional  areas  in  the  organization.  The 
functional  areas  of  an  organization  refer  to 
the  major  areas  of  activities  carried  out  by 
the  organization,  such  as  finance, 
production,  sales,  distribution,  and 
accounting.  Each  functional  area  can  then 
be  divided  into  the  processes  which  must  be 
carried  out  in  order  to  meet  the  needs  of 
the  organization.  The  functional  areas  and 
processes  should  be  those  needed  to  maintain 
the  existence  of  the  corporation.  An 
example  would  be  the  functional  area  of 
finance  which  would  need  to  carry  out  the 
processes:  financial  planning,  budgeting, 

capital  acquisition,  funds  management,  and 
banking.  This  Enterprise  Model  when 
complete  should  represent  a  comprehensive 
model  of  the  activities  carried  out  by  the 
organization.  It  should  also  be  an  under¬ 
standable  and  useful  tool  in  understanding 
the  operation  of  the  organization  as  a 
whole,  and  it  should  remain  true  as  long  as 
there  is  no  dramatic  fundamental  change  in 
the  organization's  statement  of  purpose. 

Once  the  functional  areas  and  processes  of 
an  organization  have  been  established  then 
the  data  which  is  necessary  to  support  them 
can  be  determined  by  contacting  the 
department  or  group  which  performs  each 
process.  However,  the  identif icrtlon  of  the 
functional  areas  and  processes  should  be 
totally  independent  of  the  current  structure 
of  the  organizational  chart. 

Subject  Databases 

James  Martin  has  coined  the  term  subject 
database  to  represent  the  logical  view  of 
all  data  collected  about  entitles  in  the 
corporate  domain.  Many  information  systems 
have  already  incorporated  this  methodology 
informally  by  grouping  all  records  that  are 
needed  presently  for  one  entity  into  one 
database.  The  difference  between  the 
classical  method  that  may  contain  the 
correct  information  and  the  subject  database 
method  is  that  the  classical  view  collects 
all  the  data  needed  by  the  application  that 
is  presently  under  construction,  while  a 
subject  database  would  be  constructed  to 
include  all  information  that  will  be  needed 
in  the  foreseeable  future.  These  subject 
databases  would  be  created  by  the 
interaction  of  the  data  requirements  of  the 
corporation  mapped  onto  the  enterprise  model 
of  the  corporation.  In  order  to  accomplish 
this,  data  classes  of  information  created  by 
examining  entities  in  the  organizations 
environment  would  be  cross  referenced  as  to 
which  data  classes  are  used  as  input  and 
output  of  processes  in  the  Enterprise  Model. 
By  systematically  considering  data  needed  by 
the  organization,  instead  of  the  conventional 
manner  of  merely  collecting  data  for  each 
application  as  needed,  an  overview  of  the 
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data  needs  of  the  organization  as  a  whole 
can  be  obtained.  Such  an  overview  could  bo 
very  profitable  for  the  corporation  In  teriLj 
of  programming  effort  and  timeliness.  This 
would  mean  that  whenever  a  new  application  Is 
created  the  Information  necessary  should 
have  already  been  Included  In  the  database, 
therefore  the  applications  programmer  should 
not  need  to  create  his  own  data  In  order  to 
create  a  new  application.  his  would 
greatly  decrease  the  time  and  effort 
required  for  new  applications.  Subject 
databases  also  reduce  the  number  of 
databases  necessary  for  operation.  However, 
most  organizations  do  not  deal  with  a  large 
number  of  entitles.  If  files  are  designed 
for  specific  application,  then  the  number  of 
files  needed  grows  almost  as  fast  as  the 
number  of  applications.  This  proliferation 
leads  to  redundant  data,  update  errors,  and 
poor  design  of  application  programs. 

Grouping  Subject  Databases  Into  Easily 
Implementable  Subsystems 

In  order  to  Implement  these  subject 
databases,  Martin  suggests  that  the  subject 
databases  be  grouped  Into  Implementable 
subsystems.  Then  the  subsystems  which 
satisfies  the  Immediate  needs  of  the 
corporation  should  be  Implemented  first, 
hopefully  to  produce  a  new  application  which 
Is  acknowledged  to  be  much  needed  In  Che 
organization.  The  approach  that  James 
Martin  uses  to  divide  Che  databases  Into 
Implementable  subsystems  Is  similar  to  IBM's 
Business  Systems  Planning  methodology.  Both 
approaches  rely  on  manual  methods  of 
manipulation,  and  some  areas  are  111  defined 
and  open  to  multiple  answers  depending  on 
Interpretation  of  the  person  organizing  the 
sequence  of  subject  databases. 

First  the  processes  of  the  organization  are 
ordered  by  the  life-cycle  approach.  Most 
service  and  manufacturing  organizaclonc  tend 
to  have  a  four  stage  life  cycle:  planning, 
acquisition,  stewardship,  and  disposal.  The 
databases  are  then  entered  as  columns  and 
the  Intersections  of  processes  and  databases 
are  designated  by  a  'U'  if  the  process  uses 
the  data  in  that  database,  or  by  a  'C  if 
the  data  in  that  database  Is  created  by  that 
process.  In  Martin's  example  this  matrix 
appeared  as  Figure  HI. 

Martin  then  changes  the  order  of  the  subject 
databases  so  that  the'C's  are  ordered  from 
the  top  left  hand  corner  to  the  bottom  right 
hand  corner.  However,  the  order  of  the 
processes  are  not  changed.  This  reordering 
is  done  manually  and  is  subject  to  multiple 
Interpretations.  There  Is  no  one  correct 
way  In  which  to  order  the  columns' , 
different  analyst  may  agree  on  the  processes 
and  the  subject  data  bases  and  disagree  on 
the  way  to  order  the  columns.  This  disagree¬ 
ment  can  have  far  reaching  effects  since 
this  ordering  of  the  matrix  is  then  grouped 
Into  subsystems  for  Implementation  purposes. 


These  groupings  of  'C's  Into  implementable 
subsystems  is  done  by  Inspection.  The 
selection  of  clustering  Is  Judgmental, 
however,  Martin  suggests  an  affinity 
analysis  as  a  follow  up  step.  However,  this 
affinity  analysis  Is  very  rough  and  does  not 
provide  hard  quidellnes  of  delineation  of 
borderline  cases  into  separate  groupings. 
Even  given  that  the  groupings  are  more  or 
less  correct,  other  problems  arise  from  this 
method.  Some  of  the  'U's  fall  outside  of 
the  groupings  and  are  therefore  considered 
as  exterior  data  flows  from  one  subsystem  to 
another.  Therefore,  even  attempting  to 
implement  these  subsystems  in  an  orderly 
manner  may  prove  very  difficult  for  the  new 
subsystems  will  sometimes  have  to  share  data 
with  old  systems.  This  will  produce 
Incompatibilities  and  lead  to  patching  of 
data  flows  and  more  data  redundancy  In  the 
s\  tem,  rather  than  less.  It  is  precisely 
t.ils  sort  of  complication  that  will  lead  to 
problems  in  the  attempt  to  organize  the 
corporate  data  Into  a  TYPE  III  environment. 

Another  major  problem  can  be  seen  in  the 
implementation  scheme  shown  in  Figure  #2. 

This  matrix  not  only  shows  the  problem  of 
'U's  exterior  to  the  subsystems,  there  is 
also  a  'C  which  Is  Incompatible  with 
Martin's  arrangement  of  'C's  on  the 
diagonal.  The  processes  Budget  Planning  and 
Sales  Forecasting  both  help  to  create  the 
subject  database  Budget  and  the  process  are 
so  far  removed  from  each  other  in  the  life 
cycle  that  Budget  cannot  be  arranged  in  any 
way  that  will  bring  both  'C's  onto  the 
diagonal.  Martin  disregards  this 
Inconsistency  by  not  even  mentioning  it 
specifically.  The  only  reference  made  to  the 
problem  of  two  processes  creating  one 
subject  database  and  thereby  potentially 
leading  to  this  problem  Is  a  short  statement 
to  the  fact  that  these  types  of  databases 
might  be  candidates  to  be  split  up  Into  two 
databases  therebv  artificially  alleviating 
the  problems. 

Canonical  Correlation  Analysis 

Canonical  correlation  uses  linear  compounds 
to  describe  the  dependencies  between  two 
sets  of  variables  [Morrison,  1976).  Let  Xj 
denote  the  first  set  of  variables  in  which 
there  are  r^  variables  and  N  observations. 
The  second  set,  denoted  by  X,,  contain  r^ 
variables  and  N  observation.  In  this  paper, 

N  Is  the  number  of  process  by  subject 
databases  relations  which  contain  either  a 
'U'  or  a  'C'.  The  first  set  of  variables 
consist  of  N  observations  or  r^  indicator 
variables: 

1  if  observation  i  is  from  row  J^ 

Xjij  =  0  otherwise. 

1=1,2 . N  J  =  l,2 . Tj 

and  the  second  set  consists  of  N 
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observations  of  r.  Indicator  variables: 

1  if  observation  is  from  row 

.  =0  otherwise. 

’ij 

i  =  1.2 . N  j  =  1.2 . 4^ 

The  first  step  in  canonical  correlation 
combines  and  into  a  single  matrix  with 
N  rows  and  r^  +  r.,  columns  and  calculates  the 
sample  variance-covariance  matrix  S  where: 

S  =  ( i 


After  finding  d.  and  Vj,  the  values  of 
and  X2  are  substituted  into  d^  and  v., 
respectively,  to  obtain  the  canonical  scores. 
These  scores  are  then  ranked  from  1  to  r  for 
the  d  values  and  ranked  from  1  to  c  for  the 
v^  values,  tied  scores  received  the  same 
rank.  Since  there  are  only  r^  unique  values 
of  dj  and  r  unique  values  of  Vj,  these  new 
ranks  establish  t.  o  new  row  and  column 
position  of  each  observation.  These  new 
positions  rearrange  the  rows  and  columns  of 
the  old  matrix  and  establish  a  new  matrix 
showing  the  strongest  possible 
diagonalization. 


-  ^^\l^  (Sx^.)/Nl/(N-1) 

the  variance-covariance  matrix  Is  then 
partitioned  into  four  submatrices: 


where  and  S  ..re  the  variance- 
covariance  matrices  of  and  X„, 
respectively,  and  S  contains  tne 
covariances  of  the  X^  variables  with  the  X^ 
variables.  These  variance-covariance 
matrices  are  used  to  calculate  the 
descriptive  linear  compounds. 


Canonical  correlation  describes  the 
dependencies  of  X^  and  X^  by  d^  =  a^'*Xj,  and 

v^  =  '’i'*^2  i=l,2,...,m  m=rln(r,c), 

such  that  d|  and  v^  have  the  highest 
correlation  am.ng  all  pairs  of  linear 
compounds,  d^  and  v,,  have  the  highest 
correlation  among  all  pair  of  linear 
compounds  orthogonal  (or  uncorrelated)  with 
the  first  two,  etc.  Each  pair  of  linear 
compounds  are  uncorrelated  with  all  others. 

In  the  situation  being  studied,  only  d^  and 
v  need  to  be  considered  since  the  purpose 
of  this  study  was  to  diagonalize  the  matrix 
by  rearranging  the  rows  and  columns.  The 
linear  combinations  that  show  the  strongest 
correlation  also  give  the  best  diagonal- 
izat ion. 


The  values  of  a^  and  bj  can  be  found  by 
solving  the  two  simultaneous  equations: 


(S 

(S 


S  S 

122212 


S  S  ' 
121112 


IS,i)a^ 


IS22)bj 


where  1  is  the  largest  characteristic  root 
or  eigenvalue  of  the  following  equation: 


S  S  ' 

12  22  12 

-  '^1 

=  0  or 

S  S  ’ 

-  IS 

=  0 

12  22  12 

22 

Statistical  packages  are  available  which 
will  quickly  calculate  the  vectors:  a^  .nd 
b  ( 1= 1 , 2 , . . . ,m) .  The  results  used  in  this 
paper  used  PROC  CANEORR  of  SAS,  the 
Statistical  Analysis  System. 


The  problem  with  canonical  analysis  is  the 
interpretab ■ lity  of  the  subsystems  that  are 
grouped  together.  This  procedure  will 
maximize  correlations  between  sets,  however, 
it  does  not  provide  a  facility  for 
Interpreting  the  resulting  dimensions  of  the 
subsystems  arranged  by  correlation.  Another 
cause  for  concern  might  be  the  sensitivity 
of  the  solution  to  the  inclusion  of  further 
'U's  and  'C's  into  the  matrix  itself  at  some 
latter  date.  This  has  the  possibility  of 
radically  changing  chat  solution  to  the 
correlations.  ~herefore,  the  matrix  should 
be  defined  as  completely  as  possible  before 
the  use  of  this  methodologv.  Other 
theoretical  limitations  include  outliers, 
multicollinearity  and  singularity.  However, 
^he  method  by  which  the  information  matrix 
is  developed  seem  to  negate  the  ill  effects 
of  these  problems. 

Application  of  Canonical  Analysis 

Canonical  analysis  was  performed  upon 
Martin's  original  matrix  using  the  canonical 
correlation  procedures  in  SAS.  Canonical 
row  and  column  scores  were  obtained  using 
PROC  CA.MC0R  and  the  matrix  was  then  arranged 
in  ascending  ordei  along  both  dimensions. 

Tlie  result  matrix  appears  in  Figure  R3. 

This  method  appears  to  provide  for  a  much 
more  regular  appearance  than  merely 
rearranging  the  columns,  and  is  much  less 
open  to  multiple  interpretation.  If  all 
parties  agree  to  the  processes  and  subject 
database  assignments  then  this  method 
provides  a  statistically  defensible  method 
of  rearrangement  that  is  reproducible.  In 
order  to  group  the  new  arrangement  Into 
operat iona 1 izable  subsystems  the  canonleal 
variates  are  clustered  using  a  SAS  procedure 
called  PROC  CLUSTER.  This  produces 
differing  numbers  of  cluster  and  tlie  R-square 
for  each  number  of  clusters.  These  can  then 
be  plotted  on  two  axis  and  the  number  of 
clusters  determined  by  observing  a  bend  in 
the  resulting  curve  or  by  deciding  how  many 
clusters  are  adequate  and  defensible.  If 
several  possible  numbers  of  clusters  meet 
all  requirements  then  it  may  he  possible  to 
determine  which  set  dlvlJ'’j  the  matrix  into 
the  most  easily  implemented  subsets  of 
Systems.  Although  this  does  seem  to  be 
rather  arbitrary,  the  procedure  PROC  CLUSTER 
will  au .omat leal ly  determine  which 
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observations  fall  In  which  sets  given  a 
chosen  number  of  sets. 

This  procedure  was  performed  on  the  above 
matrix  and  the  optimal  number  of  clusters 
was  determined  to  be  nine.  This  produced 
clusters  which  contain  all  of  the  'U's  and 
'C's  In  the  original  matrix.  Therefore 
there  is  no  date  being  shuffled  from  one 
subsystem  to  another.  This  will  reduce  the 
time  and  effort  spent  patching  and 
implementing  a  new  system.  The  subsystems 
are  self  contained  with  the  only 
implementation  problem  being  that  each  sub¬ 
system  will  not  contain  all  of  the  'C's  that 
create  the  data  that  is  used  in  its  '  L"  s . 

The  clustering  of  subsystems  produced  by 
PROC  CLUSTER  appears  in  Figure  1/A. 


CONCLUSION 

The  major  advantage  of  this  system  of 
rearrangement  of  processes  and  subject 
databases  and  clustering  of  the  resulting 
arrangement  is  that  it  can  be  fully 
automated.  Martin  always  stresses  that  any 
system  that  is  of  this  size  and  complexity 
should  be  automated  as  much  as  possible  in 
order  that  more  analysis  be  accomplished. 
This  methodology  could  be  used  in  such  a  way 
as  to  be  Interactive.  Thereby,  when  an 
Enterprise  Model  is  completed  the  subsystems 
could  be  determined  at  the  press  of  a 
button.  This  system  of  determining  the  sub¬ 
systems  is  also  more  easily  defended,  less 
judgmental,  and  more  open  to  reproduction. 
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SENSITIVITY  ANALYSIS  OP  THE  HERPINDAHL-HIHSCHMAN  INDEX 
James  R.  Knaub,  Jr.,  Energy  Information  Administration 


Introduction! 

The  Herf Indahl-Hlrschman  Index  (HHI), 
the  sum  of  the  squares  of  the  relative 
percent  of  sales  made  by  each  company  In 
a  market,  has  been  used  by  the  U.S.  Gov¬ 
ernment  and  Industry  to  measure  market 
concentration.  A  small  HHI  value  Indi¬ 
cates  low  concentration.  One  paper 
attributed  to  the  U.S.  Department  of 
Justice  (1982),  suggests  a  value  of  1000 
for  this  Index  to  delineate  between  "mod¬ 
erately  concentrated*  and  "unconcentra¬ 
ted"  markets.  A  value  of  1000  could  be 
the  result  of  having  ten  companies,  all 
with  equal  sales.  However,  If  two  of 
these  companies  were  to  merge,  an  HHI 
value  of  1200  would  result.  Thus  a  twen¬ 
ty  percent  change  In  the  HHI  may  be  con¬ 
sidered  to  be  substantial  In  this  case, 
or  It  could  be  considered  to  be  a  random 
change  In  a  market,  not  Indicative  of  a 
trend.  If  this  Index  Is  calculated  for 
different  time  periods  for  the  same  mar¬ 
ket,  th'sre  Is  a  question  as  to  when  one 
may  say  that  a  substantial  change  has 
taken  place.  If  a  small  change  In  a  frame 
often  results  In  a  large  change  In  the 
HHI,  then  a  small  change  In  the  HHI  may 
not  mean  very  much.  Conversely,  If  a 
large  change  In  a  frame  often  results  In 
a  small  change  In  the  HHI,  then  one  could 
say  a  small  change  In  the  HHI  may  be  very 
Important.  (Note  the  similarity  to  Type  I 
and  Type  II  errors  In  classical  hypothe¬ 
sis  testing.) 

Two  approaches  to  the  analysis  of  the 
sensitivity  of  this  Index  are  given  In 
this  paper.  Both  analyses  are  measured  In 
terms  of  the  coefficients  of  variation 
(cv),  of  simulated  resulting  HHI  values 
when  starting  with  a  given  market  and  al¬ 
lowing  certain  changes  to  take  place  in 
random  fast  Ions. 

Description  of  Approaches! 

The  purpose  of  this  paper  Is  to  dis¬ 
cuss  how  one  might  determine  whether  a 
change  In  the  HHI  Is  substantial.  Can  we 
describe  random  changes  In  a  market  which 
are  not  Indicative  of  a  trend?  In  one  ap¬ 
proach,  a  bootstrap-llke  procedure  was 
used  to  determine  the  variation  In  HHI 
values  which  could  result  should  there  be 
a  replacement  of  actual  data  by  data  ran¬ 


domly  chosen  from  the  existing  frame. The 
CVS  obtained  In  this  case  seem  an  unfair 
Judge  of  the  performance  of  the  HHI, how¬ 
ever,  since  such  a  variety  of  possible 
sets  of  hypothetical  "coapanles"  contain 
perhaps  too  many  scenarios  which  would 
not  be  considered  comparable  to  the  origi¬ 
nal,  or  observed  scenario.  "Restricted" 
case  simulations  here  are  those  where  only 
replications  which  resulted  In  a  total 
sales  volume  within  five  percent  of  the 
observed  total  volume  were  considered. 

The  second  approach  Is  to  let  each 
company’s  volume  of  business  vary  accord¬ 
ing  to  a  given  distribution  around  It's 
observed  volume  to  see  what  HHI  values 
resulted.  This  approach  may  be  more  mean¬ 
ingful  In  that  It  Is  more  Intuitive.  The 
same  total  volume  restriction  was  also 
employed  here. 

Although  there  Is  literature  to  con¬ 
sider  to  determine  the  number  of  replica¬ 
tions  needed,  here  It  was  very  simple  to 
experiment  with  numbers  of  replications 
differing  by  one  or  more  orders  of  magni¬ 
tude  to  see  what  practical  changes  occur 
In  the  results  of  Interest.  (Note  that 
this  Is  similar  to  what  was  done  In  Knaub 
(1985b),  page  457,  although  a  modifica¬ 
tion  of  the  procedure  found  In  Knaub 
(1985a),  such  as  that  Illustrated  In 
Knaub  (1986),  could  be  used.) 

Conclusions! 

From  the  table  on  the  next  page.  It 
may  be  concluded  that  one  should  be  wary 
of 

U  forecasts  of  change  based  upon 

trend  analyses  supported  only  by  a 
change  In  the  HHI  of  five  percent 
or  smaller,  and 

2)  forecasts  of  constancy  based  upon 
trend  analyses  where  the  HHI  has 
changed  by  fifteen  percent  or  more. 

Addendum! 

Suppose  a  national  market  were  to  be 
considered  by  State.  Small  changes  In 
HHI  may  Indicate  a  trend  If  enough  of 
the  State  markets  had  HHI  changes  In  the 
same  direction.  Confidence  Intervals  or 
hypothesis  testing,  considering  both 
types  of  error,  could  be  used  to  deter¬ 
mine  whether  the  trend  was  substantial. 
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Tabular  Examples  of  Simulation  Results* 


A 

B 

C 

D 

E 

F 

Att.  HHI 

4021 

2193 

605 

434 

374 

324 

NC 

B 

U 

MEAN  HHI 

52 

81 

106 

272 

392 

87 

3150 

599 

359 

319 

MED.  HHI 

2887 

- 

58? 

- 

360 

314 

HHI  OV 

44 

— 

17 

— 

20 

14 

R 

MEAN  HHI 

3968 

. 

601 

_ 

36B 

320 

MED.  HHI 

4018 

- 

597 

- 

367 

318 

HHI  CV 

IR  COMPANY  CVs=5 

U 

MEAN  HHI 

11 

13 

18 

14 

4021 

2193 

607 

435 

375 

325 

MED.  HHI 

4021 

2195 

607 

435 

375 

324 

HHI  CV 

R 

MEAN  HHI 

3.3 

4.6 

1.7 

2.4 

2.7 

1 

4021 

2193 

607 

435 

375 

325 

MED.  HHI 

4020 

2194 

607 

435 

375 

324 

HHI  CV 

IR  COMPANY  CVs=10 

U 

MEAN  HHI 

3.2 

4.5 

1.7 

2.4 

2.7 

1 

4020 

2194 

611 

438 

377 

327 

MED.  HHI 

4022 

2199 

611 

438 

377 

326 

HHI  CV 

0 

6.6 

9.2 

3.5 

4.9 

5.4 

3 

n 

MEAN  HHI 

4027 

2195 

611 

438 

377 

327 

MED  HHI 

4019 

2192 

611 

438 

377 

326 

HHI  CV 

5.8 

7.7 

3.4 

4.8 

5.4 

3 

"A,"  *B,"  "C,*  "D,"  AND  "E"  represent  retail  motor  gasoline  In  five  States 
"F*  represents  residential  distillate  In  one  State 
Indicates  data  not  collected 

"R"  represents  total  volume  restricted  to  +  or  -  5/^  of  the  observed  total  volume 

"U"  represents  "unrestricted"  volume 

"B"  denotes  a  "bootstrap-llke"  simulation 

"IR  Company  CVswx"  denotes  the  second  simulation  approach  where  Individual  replacement 
of  each  company's  sales  volume  occurs  using  a  normal  distribution  with  mean  equal  to 


the  observed  volume  f  and  CV=x 
"Att.  HHI"  Is  the  attained  HHI 
"NC"  Is  the  number  of  companies  In  the  frame 
"MEAN  HHI"  Is  the  mean  of  the  simulated  HHIs 
"MED.  HHI"  Is  the  median  of  the  simulated  HH 
"HHI  CV"  Is  the  CV  of  the  simulated  HHIs 
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Encoding  and  Processing  of  Chinese  Language — A  Statistical  Structural  Approach 

Chaiho  C.  Wang* 

U.  S.  Department  of  Justice  and  The  George  Washington  University 


Introduction 

The  performance  of  the  encoding  and 
transmission  of  language  information  may 
be  measured  by  the  following  criteria: 

1.  Preserving  culture  identity, 

2.  Maximizing  processing  speed, 

3.  Maximizing  transmission  accuracy, 
and  minimizing  ambiguity, 

4.  Minimizing  human  effort, 

5.  Minimizing  storage  requirement. 

On  the  one  hand,  the  Chinese  language 
has  had  the  time  to  grow  deep  cultural 
roots  unsurpassed  by  any  other  in 
existence.  On  the  other  hand,  it  had 
been  developed  by,  and  served  the 
relatively  few  educated  scholars.  For 
centuries,  the  practicality  of  teaching 
it  to  the  masses  has  not  been  a  priority. 
Today,  when  rapid  data  transmission  has 
become  so  important  in  life,  the 
structure  of  the  Chinese  language 
provides  information  processors  a  great 
challenge . 

This  paper  proposes  mathematical 
models  in  which  a  balanced  'approach  among 
the  five  criteria  is  considered.  Based 
on  the  statistical  structure  of  the 
Chinese  language,  this  procedure 
incorporates  a  user  friendly  input  coding 
scheme  with  a  low  redundancy  internal 
coding  method  for  compressed  storage. 

Both  the  graphical  and  pinyin  input 
options  are  considered,  and  special 
attention  is  paid  to  reduce  human  effort 
at  the  data  entry  stage.  With  the  new 
technology  of  tomorrow  in  mind,  the  goal 
of  efficiently  computerizing  Chinese 
language  may  be  within  reach. 

Basic  Encoding  Methods 

Let  L  =  (v,b)  be  a  language  where  'V 
is  a  vocabulary,  and  B  is  the  set  of 
basic  symbols  (messages).  The  language 
may  be  composed  of  elements  of  V,  which 
are  strings  of  basic  symbols  of  B.  Let  C 
be  a  set  of  codes  which  may  be  used  to 
represent  elements  of  B,  and  D  be  a  set 
of  internal  codes  which  represent  the 
elements  of  C  within  a  computer.  The  set 
D  may  be  machine  dependent,  and  need  not 
be  of  concern  to  the  user.  In  a  Shannon 
[10]  like  theory,  the  average  message 
length  (entropy)  is  defined  by 
m 

(1)  E  =  2  Pi  log  Pi 

i 

where  p^  is  the  probability  of  occurrence 
of  the  ith  message,  and  m  is  the  number 
of  elements  in  B.  Since  binary  coding 
digits  are  customarily  used,  the  symbol 


'log'  stands  for  the  logarithm  of  base  2, 
and  the  unit  length  is  called  bit  pet 
message,  or  bit  per  character  (BPC).  If 
a  code  c  of  the  set  C  represents  more 
than  one,  say  t,  messages  from  B,  then  c 
is  said  to  be  of  multiplicity  t. 

As  an  example,  the  basic  elements  B  of 
the  English  language  may  be  taken  as  the 
set  of  26  characters  (plus  a  few  special 
characters  necessary  to  form  a  grammar), 
while  V  may  be  a  set  of  words,  or  a  set 
of  fixed-length  character  strings.  The 
usage  of  these  characters,  such  as  in 
frequency  distributions,  redundancy,  and 
storage  requirements,  can  be  readily 
studied  (see  [1,  6,  9,  11,  12,  14,  15]). 
Hereafter,  we  let  "English"  stand  in 
general  for  any  alphabetized  Western 
language . 

The  human  effort  of  encoding  English 
text  normally  involves  a  one  step 
phonetic  process:  whether  the  typist  is 
listening  to  a  dictated  message,  or 
reading  from  a  document,  the  sound  of  a 
word,  say  "teacher",  is  translated  into 
the  correct  spelling,  t-e-a-c-h-e-r , 
which  is  then  entered,  letter  by  letter, 
on  a  keyboard. 

In  a  straight  forward  coding  of 
English  langauge,  assuming  that  all 
characters  occur  with  equal  frequency, 

E  =  log  26,  which  is  about  5  bits  pet 
character.  On  the  one  hand,  a  standard 
storage  cell  is  either  6-bits  or  8-bits 
in  size  for  a  standard  main  frame 
computet,  which  provides  for  use  of  both 
upper  and  lower  case  letters,  and 
manipulation  of  information  beyond 
English  text.  On  the  other  hand,  the  use 
of  compressed  storage  algorithms  (see  [2, 
3,  6,  9,  11,  12]),  which  use  numerical 
coding,  language  elements  coding,  or 
probabilistic  coding,  allows  the 
reduction  of  E  to  below  five  bits.  For 
example,  the  Huffman  [6]  minimum 
redundancy  variable-length  codes  take 
advantage  of  the  frequency  distribution 
of  the  occurrences  of  elements  of  B, 
reducing  E  to  4.2.  An  alternative 
method,  which  utilizes  fixed-length  codes 
while  splitting  B  into  groups  (see  [12]), 
can  achieve  a  similar  result.  Here  the 
following  formula  is  used  to  compute 
entropy: 

G 

(2)  E  =  2  Pi  (log  Pi  +  log  G) 

i 

where  pi  is  the  proportion  of  usage  for 
the  ith  gr^.up  (  SPi  =  !)•  Here  within 
each  of  the  G  groups  a  fixed  length  code 
is  used,  and  a  log  G-bit  flag  is  used  to 
identify  the  groups. 
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For  modeling  Chinese,  as  a  first 
attempt  we  consider  a  straight  coding 
method,  in  which  B  is  the  set  of  all 
Chinese  characters,  and  C  contains  enough 
codes  to  represent  elements  of  B 
one-to-one.  Depending  on  the 
application,  B  can  have  several  thousand 
to  tens  of  thousands  of  elements. 

Although  a  keyboard  containing  the  entire 
set  B  can  be  made  available,  the  time 
required  to  search  the  keys  make  data 
entry  prohibitively  difficult. 

In  order  to  locate  a  given  character, 
the  set  B  is  structured  into  groups  , 
according  to  shapes  of  "radicals"  (  iff). 
and  the  elements  within  a  group  are 
sorted  according  the  number  of  strokes  in 
a  radical  or  a  character.  This  is  the 
standard  dictionary  lookup  method,  which 
was  developed  years  ago  without 
automation  in  mind.  Although  a  computer 
method  can  be  devised  to  imitate  the 
dictionary  lookup,  it  is  not  practical. 
This  model  can  be  considered  as  a  strict 
cultural  approach,  where  the  meanings  and 
shapes  of  the  radicals  and  characters  are 
recognized. 

The  greater  the  size  of  B,  the  more 
cumbersome  the  data  entry  process,  but  a 
large  entropy  is  not  necessarily  a 
consequence.  Comparing  the  storage 
requirements  of  a  document  written  in 
Chinese  with  its  English  translation 
shows  that,  generally,  less  storage  space 
is  needed  for  the  Chinese  version  than 
for  its  English  translation.  Assume  that 
a  six  bit  unit  is  used  to  encode  an 
English  alphabet,  allowing  space  for  both 
upper  and  lower  case  letters,  and  a  14 
bit  unit  is  used  to  encode  a  Chinese 
character.  Simple  experiment  indicates 
that  for  coding  newspaper  information, 
the  ratio  of  storage  space  for  the 
Chinese  text  to  its  English  translation 
is  two  to  three.  For  coding  classical 
Chinese  ( wenyen  tc"?  )  »  the  ratio  is  one  to 
four.  (Source  documents  for  the 
experiment  are:  1.  China  Reconstructs 
Magazine,  published  both  in  Chinese  and 
in  English,  in  Beijing.  2.  Yen,  L. 
(1976),  A  reconstructed  Lao-Tze  with 
English  translation,  Cheng  Wen  Publishing 
Co.,  Taipei.) 

To  facilitate  dictionary  lookup,  in  a 
second  attempt,  numerical  codes  were 
developed  to  represent  the  strokes, 
radicals,  and  the  shape  of  a  character. 
This  approach  goes  back  several  decades. 
The  four  corner  coding  method  is  an  good 
example.  Several  such  methods  have 
recently  been  developed  on  a  computer 
(see,  for  example,  [4,  5]).  In  this 
case,  the  set  C  contains  "structural" 
codes,  representing  the  composition  of 
strokes  in  a  character. 

During  data  entry,  a  typist  observes  a 
character,  codes  it  according  a  set  of 


rules,  and  enters  the  numerical  code  on 
the  keyboard.  The  method  requires  a 
considerable  amount  of  human  effort — the 
coder  must  memorize  the  codes  for  the 
radicals  and  the  encoding  rules.  In 
addition,  large  work  space  and  storage 
space  may  be  required.  For  example,  with 
the  three-corner  method  [5],  three 
two-digit  fields  are  required;  that  is, 
999,999  positions  are  required  to 
represent,  say  10,000  characters. 
Moreover,  unless  the  numerical  code  is 
unique,  an  additional  auxiliary  code  must 
be  used  to  eliminate  multiplicity. 

If  an  one-to-one  correspondence  is 
established  between  the  set  of  codes  C, 
and  B,  then  the  coding  is  unambiguous, 
and  decoding  becomes  possible.  If 
several  elements  of  B  share  a  common 
code,  special  measures  must  be  taken  to 
assure  accuracy  and  decodibility  of 
messages.  We  shall  call  the  former  an 
one-to-one  method,  the  latter  a 
one-to-many  method.  Neither  the 
four-corner,  nor  the  three  corner  method 
is  one-to-one.  To  make  up  for  the 
deficiency,  an  auxiliary  code  is  needed 
to  represent  all  messages  that  belong  to 
the  same  code. 

As  a  third  attempt,  we  rely  on  the 
pinyin  method  —  the  phonic  approach 
common  to  the  majority  of  Western 
languages.  Here  B  is  the  Latin  alphabet. 
This  gives  rise  to  a  two  step  data  entry 
process:  the  typist  observes  the 
characters,  say  "  ",  translates  it  into 

is  phonetic  representation,  "shi",  then 
enters  it  into  the  keyboard.  Since  "shi" 
also  represents  many  other  words 
(one-to-many),  a  secondary  code  is 
required  to  single  out  "  ". 

As  will  be  discussed  in  the  next 
section,  the  pinyin  method  has  several 
attractive  features.  Comparing  pinyin 
usage  of  the  alphabet  with  English  there 
are  several  dissimilarities:  (1)  In 
English,  the  alphabet  symbols  are  codes 
as  well  as  language  elements.  What  you 
see  is  what  you  code.  In  pinyin,  the 
pinyin  symbols  are  codes,  not  language 
elements.  A  coder  must  first  translate  a 
character  into  its  pinyin  representation, 
then  enter  the  code  in  the  keyboard.  (2) 
Non-standardization  of  pronunciation  of 
characters  give  rise  to  inaccuracy 
problems.  (3)  There  may  be  many 
characters  with  the  same  pronunciation, 
therefore,  making  decoding  difficult. 

(4)  Each  character  may  have  four  or  five 
tones,  which  will  contribute  to 
inaccuracy  and  identification  problems. 

Statistical  structure 

In  a  straight  coding  of  English  by 
singleton  characters,  E  =  log  26.  After 
the  letters  are  arranged  according  to 
their  frequency  of  occurrence,  in 
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descending  order,  and  the  cumulative 
frequency  distribution  is  tabulated,  they 
can  be  split  into  two  or  more  groups. 

For  example,  among  the  26  letters,  the 
eight  most  commonly  used  letters  account 
for  60  percent  of  the  usage.  We  have 

E  =  .6(log  8  +  1)  +  .4(log  18  +  1)  =  4.46 

which  is  greater  than  the  Huffman 
entropy,  but  less  than  log  26  =  4.7. 

For  a  single-character  splitting  of 
Chinese  characters,  let  m  be  the  number 
of  characters,  and  assume  that  the  first 
X  most  frequently  used  characters  account 
for  p  percent  of  the  usage.  Then  a  split 
between  the  two  groups  of  x  characters 
and  the  remaining  m  -  x  characters 
yields  the  following  entropy 

E  =  p(log  X  +  1)  +  (1  -  p)(log(m  -  x)  +  1) 

Assuming  that  m  is  sufficiently  larger 
than  X,  the  splitting  would  keep  E 
unchanged  if 

p(log  X  +  1)  +  (1  -  p)(log  m  +  1)  =  log  m 

It  follows  that 

p  =  l/( log  m  -  log  x) . 

If  m  =8000,  and  x  =  500,  then  p  =  1/4. 

This  says  that  if  500  common  characters 
account  for  at  least  25  percent  of  the 
usage,  then  numerically  a  splitting  will 
decrease  entropy. 

In  an  early  Book  by  zipf  [15], 
frequency  distribution  of  the  usage  of 
Chinese  characters  is  tabulated  (see 
Table  1).  This  table  is  produced  based 
on  a  sample  of  20000  syllables  of  speech 
in  Beijing  dialect,  which  are  arranged 
according  to  their  frequencies  of 
occurrence.  Reading  Table  1  from  the 
left,  the  first  column  gives  the  number 
of  occurrences,  the  second  column 
presents  the  number  of  words  assigned  to 
a  given  frequency  of  occurrence,  and  the 
other  columns  indicate  the  number  of 
words  in  each  frequency  group  having  one 
or  more  syllables. 

Applying  Huffman  minimum-redundancy 
coding  method  to  Table  1  data,  E  can  be 
computed  as  9.654  bits  per  word.  Since 
there  are  13,118  words  among  20,000 
syllables,  the  Huffman  entropy  per 
syllable  would  be  lower  than  9.654.  In 
Zipf's  work,  the  relationship  between  a 
syllable  and  a  character  was  not  clearly 
defined.  Therefore,  the  actual 
(character)  entropy  cannot  be  deduced. 

From  this  table,  however,  we  can 
determine  that,  approximately  five 
percent  of  the  syllables  account  for 
fifty  percent  of  the  usage,  ten  percent 
of  the  syllables  account  for  sixty 
percent  of  the  usage,  and  forty  percent 


of  the  syllables  account  for  eighty  five 
percent  of  the  usage.  For  these  three 
profiles,  using  formula  (2),  we  find 
equal  to  10.6,  10.4,  and  10.8, 
respectively. 


Number  of  Number  of  Number  of  Words 


Table  1.  Chinese  of  Beijing 
(Adopted  from  G.  Zipf:  The  Psyco-Biology  of  Language) 


In  today's  standard  classification 
[8],  there  are  two  classes:  class  one  of 
3,755  common  characters,  and  class  two  of 
3,008  uncommon  characters.  If  the  6,763 
characters  are  treated  equally,  the 
entropy  is  12.7.  If,  similar  to  Zipf's 
findings,  five  percent,  or  338  characters 
account  for  50  percent  the  usage,  then 
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E  =  (log  338  +  log  6425)/2  +  1  =  11.5, 
which  represents  more  than  a  fifty 
percent  storage  reduction  over  the  direct 
method.  If  the  vocabulary  is  to  be 
expanded  to  cover  uncommon  Chinese 
characters,  in  the  range  of,  say,  30,000 
to  50,000  characters,  the  splitting 
method  will  even  be  more  useful. 

Frequency  tables  such  as  Table  1  can 
be  created  based  on  the  domain  of 
information  to  be  processed.  For 
example,  a  vocabulary  in  chemistry  would 
be  different  from  that  for  composing 
children's  books.  Clearly,  storage 
compression  can  be  made  more  effective  if 
the  correct  vocabulary  is  defined. 

Next,  we  consider  the  structure  of  the 
graphical  Chinese  symbols.  According  to 
the  standard  classification,  there  are 
167  radicals.  Conveniently,  all 
characters  can  be  divided  into  167  groups 
using  the  leading  radical  as  an  index. 

For  the  moment,  let  us  call  the  leading 
radical  and  the  remaining  parts  of  the 
character  the  "head"  and  the  "body"  of  a 
character,  respectively.  For  the  purpose 
of  linguistic  studies,  radicals  are  being 
"isolated"  according  to  their  meanings 
and  historical  background.  The  more 
accurate  the  classification,  the  larger 
becomes  the  collection.  For  reducing 
human  effort  in  data  processing,  however, 
the  set  of  codes  for  radicals  should  be 
reduced.  Because  the  Chinese  vocabulary 
has  been  carefully  developed  and  refined 
over  the  centuries,  two  disjoint  radical 
groups  can  be  merged  with  little  risk  of 
ambiguity.  For  example,  the  groups  with 
leading  radicals  "wen"  (jC)/  and  "fanwen" 
,  have  12,  and  32  characters, 
respectively;  but  there  is  no  overlapping 
among  the  bodies  of  the  total  44 
characters.  That  is,  if  we  use  the  same 
leading  code  for  wen  and  fanwen,  a 
one-to-one  correspondence  between  the  44 
characters  and  their  designated  codes  is 
preserved;  and  a  computer  will  have  no 
trouble  distinguishing  between  them. 

Suppose  we  throw  in  another  group  with 
a  leading  radical  similar  in  shape  to 
wen,  say  the  "tongzhitou"  ij^)  ,  we  add 
another  ten  characters  to  the  list.  At 
this  point,  we  finally  encounter  one 
overlap  of  a  character  body:  (a),  when 
this  occurs,  we  can  simply  create  a 
secondary  code  to  differentiate  between 
( If-)  and  {-^)  . 

If  we  can  regroup  and  reduce  the  set 
of  codes  for  radicals  to,  say,  a  total  of 
64,  a  pleasant  keyboard  containing  these 
radicals  can  be  developed,  and  a  data 
entry  method  based  on  graphical  coding 
becomes  feasible. 


Western  languages  will  be  applicable  to 
pinyin.  A  closer  look  reveals  two 
distinctions.  First,  among  the  26 
letters,  26^  diagrams,  26^  trigrams,  ..., 
only  407  pinyin  units,  from  one  to  six 
letters  were  actually  used!  With  a  few 
exceptions,  a  pinyin  unit  is  formed  by 
adjoining  a  leading  code  and  a  terminal 
code  (similar  to  the  use  of  consonant  and 
vowels  in  English).  Since  there  are  23 
leading  elements  and  33  terminal 
elements,  the  maximum  possible  number  of 
pinyin  units  is  23x33  =  759. 


Second,  each  pinyin  unit  may  represent 
between  one  and  115  distinct  characters. 
If  an  auxiliary  code  is  used  to  handle 
the  multiple  representation  we  need 
407x115=46805  codes  (E=15.5).  In  order 
to  reduce  the  code  size,  the  407  pinyin 
units  may  be  arranged  according  to  the 
multiplicity  of  each  unit,  and  a  variable 
size  auxiliary  code  is  used  to 
differentiate  the  elements  within  a 
group. 

A  mathematical  model 

Based  on  the  statistical  properties  of 
the  Chinese  language,  we  propose  a  model 
for  encoding,  storage,  and  transmission 
of  Chinese  language  data.  This  model 
includes  (1)  an  internal  "minimal" 
redundancy  code  that  stores  information 
efficiently.  (Not  necessarily  an  absolute 
minimum,  but  rather  a  minimum  with 
respect  to  a  particular  application). 

(2)  a  user  friendly  input  method  which 
"minimizes"  the  human  effort.  The  input 
method  can  be  based  on  either  the  graphic 
or  the  pinyin  approach.  Finally,  (3)  an 
automated  dictionary  "lookup"  method 
which  links  these  two  devices  together. 


The  Internal  Code. 


First,  a  vocabulary  of  characters, 
such  as  the  standard  collection  of  the 
6763  characters,  is  defined.  Next,  an 
experiment  is  conducted  to  determine  the 
frequencies  of  occurrence  of  the 
characters.  Based  on  the  statistics,  a 
user  preferred  data  compression  method, 
which  facilitates  the  sotting,  merging 
and  other  data  processing  tasks,  is 
developed.  Common  phrases  can  sometimes 
be  coded  as  single  messages  to  further 
reduce  storage.  If  the  6763  characters 
were  treated  equally,  the  entropy  is 
12.7.  This  may  serve  as  a  guide  for 
judging  the  performance  of  the  new  coding 
scheme.  As  we  have  seen  earlier,  the 
Zipf  data,  in  which  five  percent,  or  338 
messages  account  for  50  percent  the 
usage,  yields  an  entropy  of  11.5. 


The  Input  Mode 


The  pinyin  method  adopts  the  26  Latin 
letters  as  the  basic  symbols.  At  a  first 
glance,  the  basic  analysis  for  other 


The  input  can  be  either  in  a  graphical 
coding  mode,  a  pinyin  mode,  or  a  mixed 
mode,  such  as  the  method  adopted  by  the 
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Chinese  National  Bureau  of  Standards  for 
classifying  the  Chinese  characters  (see 
[8]  )  . 

The  Graphical  Method.  First,  the 
number  of  radical  groups  must  be  reduced, 
as  described  in  the  previous  section.  A 
collection  of  no  more  than  64  basic 
radicals  would  be  suitable  for  quick 
recognition  on  a  keyboard.  Next,  as  the 
encoding  begins,  a  given  character  is 
decomposed  into  a  number  of  basic 
radicals,  which  are  linked  together  by 
following  the  natural  sequence  of  drawing 
the  strokes  of  the  character.  The  code 
is  a  linked  list  of  radical  code  (RC)  and 
direction  code  (DC). 

[rCiJ  DCiJ^RC2]  DC2]  [RCn]  o] 

The  direction  code  can  be  simply 
defined  as  a  2-bit  pointer,  say,  "I*  for 
a  top-down  movement,  ’2"  for  a 
lef t-to-right  movement,  and  "0"  as  a  nil 
pointer  indicating  the  end  of  the 
character  code. 

For  example,  the  character  (  Will 
have  the  structure  n 

Taj 

0 

and  the  code  RC(^)-l-RC(o)-2-RC(^ )-0 

Here  the  code  consists  of  three  6-bit 
radical  codes  and  three  2-bit  direction 
codes,  for  a  total  24  bits.  Although 
some  of  the  characters  can  have  a  quite 
long  code,  the  average  length  per 
character  will  be  around  22  bits.  After 
the  complete  set  of  these  variable  length 
codes  are  sorted,  a  look-up  table  for 
decoding  can  be  set  up. 

If  fixed  length  codes  are  desired,  a 
numerical  code  similar  to  the  four 
corner,  or  three  corner  method  can  be 
developed.  A  typical  code,  based  on 
regrouping  of  radicals,  will  look  like 

PC  I^RCi  |^RC2  j^McJ 

where  PC  is  a  shape-code  describing  how 
the  character  is  partitioned  into 
radicals,  RCi  and  RC2  are  the  two  radical 
codes,  and  MC  is  an  auxiliary 
multiplicity  correcting  code  pinpointing 
the  given  character  within  the  radical 
group.  The  ordinary  three  corner  method 
requires  a  20-bit  (fixed  length)  code. 
With  an  auxiliary  code,  the  total  length 
would  be  raised  to  about  25  bits.  The 
basic  four  corner  code  is  only  14  to  15 
bits,  but  its  auxiliary  code  would  be 
quite  large. 


The  phonic  method.  The 
popularity  of  the  pinyin  method  depends 
on  the  extent  the  pronunciations  of 
Speaking  Chinese  is  unified.  For  users 
who  have  acquired  the  pronunciation 
skills,  this  method  is  very  promising. 
Standard  pinyin  keyboards  have  been 
developed  and  improved  (see  [7,  13]).  A 
typical  pinyin  code  would  have  the  form 


LC  TC  MC 


where  LC  and  TC  denote  the  leading  and 
terminal  pinyin  codes,  respectively.  MC 
is  the  auxiliary  code  required  to 
eliminate  multiplicity. 

In  the  worst  case,  a  full  pinyin  code 
would  have  23x34x115  =  89,930  messages, 
which  translates  to  16.5  BPC.  Since 
there  are  only  some  407  active  pinyin 
units,  the  total  number  of  messages  is 
reduced  to  403X115  =  46,345,  or  15.5  BPC. 
But  the  auxiliary  code  can  be  reduced 
too.  Since  the  degree  of  multiplicity 
varies  among  pinyin  units,  the  auxiliary 
code  can  be  made  into  variable  length. 
When  the  basic  pinvi'n  unit  codes  (LC-TC) 
ate  sorted  according  to  the  values  of  the 
auxiliary  code,  the  units  with  a  large 
auxiliary  lookup  table  are  separated  from 
the  rest.  Only  when  these  units  are 
called  for,  need  one  allocate  maximum 
space  for  processing  the  auxiliary  lookup 
table. 

If  a  fixed  length  auxiliary  code  is 
desired,  the  pinyin  unit  which  has  a 
large  multiplicity  can  be  divided  into 
two  or  more  records  in  the  following 
form: 


j^LC  j^TC  [aCiJ - j^LcJ  TC  [*^2^ - .  .  . 

Since  among  the  407  pinyin  units,  only 
50  have  a  multiplicity  32  or  higher,  a 
5-bit  auxiliary  code  would  be  sufficient. 
The  unit  with  the  greatest  multiplicity 
(115),  will  be  divided  into  four  records. 

To  take  full  advantage  of  the  fact 
that  there  are  only  407  active  pinyin 
codes,  a  voice  activated  procedure  can  be 
developed  for  data  input.  After  the 
computer  is  programed  to  recognize  these 
407  sound  patterns,  a  coder,  while  input 
a  data  set,  may  sound  out  the  characters 
one  by  one.  Each  time,  the  computer 
would  identify  the  pinyin  unit,  prompt 
the  coder  with  a  screen  full  of 
characters  belonging  to  the  pinyin  unit. 
To  maximize  efficiency,  for  each  pinyin 
(group)  code,  the  corresponding  auxiliary 
set  of  characters  is  presorted  according 
to  their  frequency  of  usages.  After  the 
coder  points  to  the  designated  character, 
it  is  automatically  coded. 
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Summary 


At  first  glance,  the  Chinese  languac 
is  too  complicated  for  automated 
processing.  A  closer  look  at  its 
statistical  structure,  however,  reveals 
many  "built  in"  features  suitable  for 
efficient  encoding  and  processing.  Much 
experimentation  will  be  needed  to  shade 
light  on  the  statistical  structure  of  the 
usage  of  Chinese  characters.  The 
encoding  procedures  proposed  in  this 
paper  may  sound  too  good  to  be  true;  but 
the  technology  is  available,  the  theory 
is  simple,  and  the  potential  is 
promising,  therefore,  further  research  in 
this  direction  is  warranted. 

*  The  views  expressed  in  this  paper  do 
not  necessarily  reflect  those  of  the 
Department  of  Justice. 
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ABSTRACT 

Many  hormones  are  secreted  into  the  blood  in  a 
p.ilsatile  manner;  i.e.,  in  high  concentrations  at 
'random'  times.  To  study  hormone  production, 
investigators  assay  its  level  in  the  blood  at 
regularly  spaced  intervals.  The  statistical 
problem  is  to  differentiate  between  changes  in 
the  level  of  the  hormone  and  observations  influ¬ 
enced  by  a  'random'  pulse  ('noise').  Two  algo¬ 
rithms  are  described:  One  uses  regression- like 
statistics  computed  after  deleting  the  most 
'extreme'  observation  combined  with  a  moving 
variable-length  window  to  identify  rises  and 
declines  in  hormone  level.  The  deletion  of  the 
most  'extreme'  observation  and  the  use  of  a 
variable -length  window  facilitates  the  exclusion 
of  'noisy'  values  from  the  determination  of  the 
stage  of  the  hormone.  The  second  algorithm  uses 
a  least-squares  criterion  to  cluster  adjacent 
points  after  the  elimination  of  'noisy'  values.  A 
test  statistic  for  termination  of  the  clustering 
is  described. 

Keywords:  cluster  analysis,  pattern  analysis, 
regression,  biological  rhythms 

INTRODUCTION 

It  is  believed  that  hormones  regulate  many 
different  time-dependent  processes  such  as 
fertility  cycles  (Blttman  et  al  1983a, b;  Farner 
and  Follett,  1979;  Robinson  and  Follett,  1982). 
Some  rhythms  are  annual,  others  are  monthly,  and 
yet  others  are  dally.  Many  investigators  are 
studying  the  manner  by  which  this  time -dependence 
is  regulated  and  how  the  time -dependence  can  be 
disrupted  (Nett  and  Niswender,  1982;  Robinson  and 
Karsch,  1986).  For  example,  many  annual  rhythms 
are  synchronized  by  the  number  of  hours  of  day¬ 
light:  maintaining  animals  in  a  light-controlled 
environment  and  modifying  the  length  of  the  light 
period  is  used  to  study  the  effect  of  disrupting 
this  stimulus.  Similarly,  many  dally  rhythms  are 
affected  by  the  time  of  day  and  are  synchronized 
by  the  light/dark  cycle. 

Although  it  is  anticipated  that  hormone  produc¬ 
tion  is  regulated  in  some  manner  (by  responding 
to  a  stimulus  which  may  be  another  hormone),  it 
is  not  unusual  for  some  amount  of  the  hormone  to 
be  present  in  the  blood  stream  even  when  produc¬ 
tion  of  the  hormone  is  in  a  reduced  state.  It  is 
now  recognized  that  some  hormones  are  released  In 
a  pulsatile  manner  from  the  gland  in  which  they 
are  produced.  That  is,  a  high  concentration  of 
hormone  (a  pulse)  is  released  into  the  blood  over 
a  relatively  short  interval.  The  hormone  is  then 
extracted  from  the  blood  as  it  passes  through  an 
organ,  such  as  the  liver,  or  mixes  rapidly  with 
the  blood  during  the  next  few  passes  through  the 
circulatory  system. 

Experiments  to  understand  the  time  pattern  of 
hormone  production  Involve  repeated  sampling  of 
the  blood;  the  samples  are  then  assayed  for  the 


amount  of  hormone.  The  number  of  samples  is 
limited  by  the  cost  or  by  the  amount  of  blood 
that  can  be  removed  without  causing  damage  to  the 
subject  (human  or  animal).  Usually,  continuous 
sampling  is  not  feasible.  The  number  of  samples 
is  determined  by  the  length  of  time  over  which 
samples  are  required;  the  greater  the  length  of 
time,  the  greater  the  spacing  between  samples. 
When  studying  hormone  levels  during  an  8  hour 
period,  it  may  be  possible  to  take  2  to  3  samples 
per  hour;  if  the  period  is  only  4  hours,  5  to  6 
samples  per  hour  may  be  taken.  In  contrast,  when 
studying  an  annual  rhythm,  only  two  to  three 
samples  per  week  may  be  possible. 

Often  the  samples  are  assayed  by  radioimmuno¬ 
assay  methods  to  estimate  the  amount  of  hormone 
(Niswender  et  al  1969).  For  the  purposes  of  this 
paper  we  will  assume  that  the  method  of  estima¬ 
ting  the  hormone  has  been  standardized.  Since 
high  values  usually  have  larger  variability  than 
small  values,  a  logarithmic  transformation  is 
often  applied  before  any  analysis.  This  trans¬ 
formation  will  be  performed  before  all  analyses 
described  in  this  paper. 

The  statistical  problem  is  to  Identify  when  the 
level  of  hormone  is  elevated  as  compared  to  when 
it  is  at  baseline.  If  the  pattern  is  expected  to 
consist  of  cycles  of  baseline  and  plateau;  each 
cycle  can  be  characterized  by  four  states;  base¬ 
line,  a  rise,  a  plateau  and  a  decline,  followed 
again  by  baseline,  etc.  In  order  to  compare  the 
effects  of  different  Interventions,  it  is  desir¬ 
able  to  have  an  objective  method  to  identify 
these  four  states  and  the  times  at  which  changes 
between  the  states  occur. 

When  this  pattern  is  expected  to  repeat  itself 
at  regular  Intervals,  approaches  to  the  estima¬ 
tion  of  the  frequency  of  pulses  are  spectral 
analysis  (Koopraans,  1974)  or  the  fitting  of  ARIMA 
models  (Box  and  Jenkins,  1976).  Two  problems 
with  these  approaches  are  that:  (1)  because  of 
limitations  on  the  amount  of  blood  drawn,  often 
only  one  cycle  is  observed  in  any  subject,  and 
(?)  many  experiments  are  designed  to  disrupt  the 
rhythm  so  that  the  series  will  not  be  stationary. 

Any  sample  taken  during,  or  shortly  after,  the 
release  of  a  bolus  (a  concentrated  pulse)  will 
show  very  high  levels  of  hormone.  These  boluses 
are  released  at  randomly  spaced  times,  more  fre¬ 
quently  when  production  of  the  hormone  is  rapid 
and  less  frequently  when  the  production  of  the 
hormone  is  at  a  nadir.  However,  If  the  blood 
sample  is  taken  near  the  time  of  a  pulse  (whether 
at  the  peak  or  nadir  of  the  cycle),  a  high  level 
of  hormone  will  be  found  in  the  blood.  At  the 
nadir  of  the  cycle,  it  Is  less  likely  that  the 
values  of  several  successive  samples  will  all  be 
elevated. 

Therefore,  a  statistical  model  for  the  hormone 
would  include  the  four  phases  (baseline,  rise, 
plateau,  and  decline).  The  error  term  would  be 
composed  of  two  parts,  one  conventional  (.due  to 
random  biological  and  technical  variation)  and  a 
second  probab i 1 i s t  Ic  to  reflect  the  possibility 
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of  sampling  at  or  near  the  time  of  a  bolus  even 
when  the  overall  level  of  hormone  may  be  low. 

In  this  paper  we  will  first  describe  an  empiri¬ 
cal  algorithm  to  identify  the  four  phases  of  the 
cycle.  The  core  of  the  algorithm  is  to  identify 
the  four  phases  by  using  regression  statistics 
that  are  computed  about  each  time  point  using 
limited  sets  of  continguous  data  values.  A  draw¬ 
back  to  this  algorithm  is  the  difficulty  in 
specifying  a  statistical  criterion  or  test  for 
the  presence  of  a  cycle. 

Therefore,  a  second  algorithm  is  also  described 
which  clusters  contiguous  points  using  a  least 
squares  algorithm.  A  conservative  t-like  statis¬ 
tic  is  proposed  as  a  criterion  to  terminate  the 
clustering  process. 

Although  more  work  is  needed  on  the  development 
of  these  algorithms,  the  use  of  an  algorithm 
(even  one  that  is  not  optimal)  provides  results 
that  can  be  compared  across  interventions  and  is 
preferable  to  the  subjective  evaluations  of 
cycles  and  phases  that  are  currently  used. 


ALGORITHM  1:  USING  REGRESSION  STAT 


The  model  assumed  for  the  data  is  that  there 
are  four  phases: 

baseline  y^.  -  c  +  Cj. 

rise  y^  -  a  +  bt  +  Cj.  b  >  0 

plateau  yj.  -  c  e^. 

decline  y^,  -  a  -  bt  +  e^  b  >  0 

where  a,  b  and  c  are  constants  that  differ  be¬ 
tween  phases  and  between  cycles  and  b  is  strictly 
positive.  Each  model  holds  only  for  a  single 
phase  which  is  represented  by  a  restricted  inter¬ 
val  in  time  (t);  amplitudes  of  the  baselines  and 
of  the  plateaus,  and  the  slopes  of  the  rises  and 
declines  will  differ  at  different  cycles  (time 
intervals).  That  is,  no  regularity  of  the  signal 
is  assumed. 

The  error  e^  is  composed  of  two  parts: 

€(,  -  Ej.  +  h(t-tQ) 

where  Ej.  is  an  error  term  with  mean  zero  and 
constant  variance  and  h(t-tg)  is  a  function  that 
represents  the  height  of  the  signal  due  to  a 
bolus  released  at  time  tg  but  assayed  at  time  t. 
The  function  h(t-tg)  is  positive,  but  may  be  zero 
when  the  sample  is  taken  sufficiently  far  (in 
time)  from  the  last  bolus  so  that  the  net  effect 
of  the  bolus  is  negligible  (part  of  the  baseline 
or  plateau).  Note  that  the  effect  of  h  on  the 
error  term  is  asymmetric. 

The  first  attempt  at  developing  the  algorithm 
was  to  fit  line  segments  to  the  data  using  a 
fixed  window  (l.e.,  using  the  same  number  of 
contiguous  points  each  time)  and  centering  the 
window  at  each  time  point  in  the  data  sequence. 
Not  surprisingly,  single  high  values  (caused  by 
sampling  near  a  bolus)  produced  estimates  of 
slopes  that  were  positive  when  approaching  the 
point  and  negative  after  the  point.  That  Is, 
values  caused  by  a  bolus  were  highly  influential 
in  determining  the  estimates  of  the  coefficients 
and  therefore  in  determining  the  phase.  Hence, 
the  following  approach  was  used. 

Let  t  represent  the  c  nter  of  a  window  (set  of 
contiguous  points)  and  g  the  number  of  data 
points  in  the  window  to  each  side  of  t.  That  is, 
the  data  points  in  the  window  are 


^t-g’^t-g+i . yt-i'^t'^t+i' ■  ' '^t+g- 1 t+g' 

For  each  t  and  a  set  of  g's  (gj^  <  6  82^' 

estimate  the  regression  statistics: 


-Syt/(SyyStt) 


the  mean  of  y:  y 

the  variance  of  y:  Syy 

the  covariance  of  y  and  t:  s^j. 

Che  variance  of  t:  Sj.(. 

the  slope  of  y  on  t:  b  -  s 

y  r.  CL  , 

the  correlation  of  y  and  t:  r  -s^^/ {Sy..s^.  ) 

The  above  statistics  were  then  corrected  to  elim¬ 
inate  the  most  discrepant  observation  in  the 
wii-dow.  (Initially,  this  criterion  was  used  to 
eliminate  the  maximal  value  only  but  examination 
of  Che  signal  indicated  that  at  the  time  that 
pulsing  slowed,  single  isolated  low  values  may 
occur  In  Che  signal.  Therefore,  the  criterion 
was  modified  to  be  symmetric.)  This  adjustment 
is  applied  to  all  windows  in  a  uniform  manner  so 
that  statistics  across  windows  are  comparable. 

There  is  no  unique  width  of  window  that  is 
optimal.  A  single  width  is  not  desirable,  since 
it  reduces  the  flexibility  of  the  algorithm  to 
smooth  across  data  values  and  fulfil  a  criterion 
for  identifying  a  baseline  or  plateau.  The  use  of 
a  variable  length  window  (i.e.,  windows  of  length 
7,  9  and  11  were  used  simultaneously)  allows 

greater  smoothing  across  data  values  and 
therefore  less  very  short  cycles  are  identified. 
(Very  short  cycles  are  not  believed  to  be  of 
physiological  interest.) 

We  have  chosen  to  use  empirical  critical  values 
to  determine  the  phases,  where  the  values  are 
estimated  by  percentiles  of  Che  empirical 
distribution.  For  each  of  the  aoove  statistics, 
the  median  and  the  quartlles  are  estimates.  Two 
exceptions  should  be  noted.  Since  the  baseline 
in  some  hormones  may  be  below  the  threshhold  of 
the  assay  and  therefore  Syy  may  be  zero  for  many 
evaluations,  the  quartlles"^  for  s  are  computed 
from  nonthreshhold  (minimum)  valueV.  For  slope, 
the  upper  and  lower  quartlles  correspond  Co  the 
medians  of  the  positive  and  negative  slopes 
respectively , 

The  criteria  for  determination  of  possible 
phase  at  a  time  point  are: 


baseline : 


(rain  <  Q25 
OR  min  Syy  <  median) 
AND  y  <  Q25 


max  r.^  > 

OR  max  slope^  >  Q75 

OR  (Syy  >  Q75 

AND  max  slope^  >  |max  slope_|) 


plateau : 


(min  r 
OR  rain  Syy 
AND  y  >  Q25 


<  Q25 

<  median) 


decline : 


max  r_  > 

OR  max  slope_  <  Q2^ 

OR  (Syy  >  Q73 

AND  max  slope^  <  |raax  slope] 


where  min  and  max  are  the  minimum  and  maximum, 
respectively,  of  the  statistic  over  all  Che  win- 


dows  for  which  the  statistic  Is  computed.  Q25  is 
the  lower  quartlle  and  Is  the  upper  quartlle 

for  all  values  of  the  statistic  (as  computed  with 
the  exceptions  described  above) .  The  +  or  -  sign 
used  as  a  subscript  indicates  chat  those  statis¬ 
tics  that  are  computed  only  when  the  slope  is 
positive  or  negative,  respectively;  i.e.,  those 
appropriate  to  identify  a  rise  or  a  decline,  res¬ 
pectively,  AND  and  OR  are  the  logical  operators: 
both  sides  of  an  AND  must  be  fulfilled  for  the 
condition  to  be  accepted,  but  only  one  (or  both) 
side(s)  of  an  OR  needs  to  be  fulfilled. 

Based  on  the  above  definitions  of  phases,  many 
time  points  can  be  assigned  to  more  chan  one 
phase.  Therefore,  the  final  phase  for  each  time 
point  is  chosen  using  the  philosophy  that  the 
method  of  allocation  should  favor  a  point  being 
set  to  a  baseline  or  plateau  when  the  set  of 
neighboring  points  have  low  variablllity  (or  low 
correlation  with  time).  Only  when  there  is 
sufficient  variability  within  a  contiguous  set  of 
time  points  should  the  phase  be  set  to  a  rise  or 
decline  (depending  on  the  sign  of  the  coeffi¬ 
cient)  .  This  is  implemented  in  the  following 
manner;  (Lower  case  letters  indicate  that  the 
criteria  <.or  the  phase  are  fulfilled  and  capital 
letters  indicate  that  the  phase  has  been  assigned 
to  the  time  point.) 

Pass  1 : 

IF  (baselin.  AND  NOT  plateau)  THEN  baselihe 

IF  (plateau  AND  NOT  baseline)  THEN  plateau 

IF  (ONLY  rise)  THEN  rise 

IF  (ONLY  decline)  THEN  declihe 

Pass  2: 

In  this  and  the  next  passes,  contiguous  points 
that  are  as  yet  unallocated  to  phases  but  have 
the  same  set  of  ;-jssible  phases  are  feated  as  a 
single  time  point  in  terms  of  determining  the 
preceding  and  following  phases. 

IF  (baseline  OR  plateau  AND: 

continguous  to  point  set  to  baselire) 

THEN  BASELIKE 

continguous  to  point  set  to  plateau) 

THEN  PLATEAU 

follows  a  point  set  to  oeclike) 

THEN  BASELIKE 

follows  a  point  set  to  rise) 

THEN  PLATEAU 

precedes  a  point  set  to  rise) 

THEN  BASELIKE 

precedes  a  point  set  to  declike) 

THEN  plateau 

Pass  3 : 

All  unallocated  points  are  set  to  a  phase  that 
Is  most  consistent  with  the  phases  of  the 
neighboring  time  points 

All  as  yet  unallocated  time  points  t  •  '  ave  a 

permissible  phase  equal  to  a  contlgur  1  ■  point 
which  has  already  been  assigned  Its  phase  are  set 
to  the  phase  of  the  neighboring  point. 
Otherwise,  if  a  permissible  phase  is  consistent 
with  a  phase  that  should  follow  the  phase  of  the 
preceding  point  or  to  a  phase  that  should  precede 
that  of  the  following  point  (if  valid),  then  that 
phase  is  selected  for  this  time  point.  Those 


time  points  which  do  not  fulfill  the  requirements 
of  any  of  the  phases  are  set  equal  to  a 
neighboring  phase. 

Discus-.ion  of  the  Regression  Approach 

Several  aspects  of  this  algorithm  need  further 
explanation; 

When  using  a  short  window,  a  high  correlation 
may  occur  even  when  only  one  p^lnt  differs  from 
all  the  other  points;  e.g.,  the  correlation  of 
time  to  a  sequence  of  k  values  that  are  constant 
except  for  one  endpoint  which  is  unequal  to  the 
other  values  is  (3/(k+l))^;  i.e.,  when  k  is  9, 
the  correlation  is  0.55.  Therefore,  high 
correlations  can  occur  in  the  presence  of  low 
variation.  By  assigning  all  cases  to  the 
baseline  or  plateau  when  there  is  low  variation, 
high  correlations  by  themselves  do  not  cause 
these  times  to  be  classified  into  a  rise  or 
decline . 

We  have  already  described  the  problem  of  high 
values  near  the  time  of  release  of  a  bolus. 
Since  it  is  not  possible  to  predict  when  these 
will  occur,  the  algorithm  must  be  relatively 
insensitive  to  large  spikes.  The  blood  will 
usually  complete  several  full  circulations 
through  the  body  between  two  successive  samples; 
therefore,  the  high  level  of  hormone  released  by 
a  bolus  should  primarily  affect  the  value  of  a 
single  observation.  (Under  very  rapid  sampling  it 
is  possible  that  more  than  one  consecutive  sample 
will  elevated  by  a  single  bolus.)  The 
elimination  of  the  maximal  value  in  each  window 
reduces  the  effect  of  a  possible  spike.  We  do  not 
suggest  testing  for  a  spike  prior  to  eliminating 
a  point  since  the  distribution  of  the  elevated 
value  is  a  continuum  and  the  observed  value 
depends  on  the  timing  of  the  sample  relative  to 
the  actual  release  of  the  bolus. 

Empirical  values  are  used  to  determine  the 
assignment  of  time  points  to  phases.  The  evalu¬ 
ations  of  the  statistics  reuse  the  same  data  many 
tiroes:  both  for  different  windows  centered  at  the 
saroe  time  point  and  also  for  windows  centered  at 
continguous  or  nearby  time  points.  Therefore, 
the  set  of  statistics  generated  in  the  first 
phase  of  the  algorithm  are  highly  correlated. 
For  this  reason  it  would  be  difficult  to  develop 
exact  tests  of  significance.  For  example,  for 
each  window  we  also  computed  the  F-statlstic  that 
tests  whether  the  slope  (or  coi  'elation)  is  zero. 
Since  definite  cycles  exist  in  the  data  that  we 
have  analyzed,  the  F-statlstics  corresponding  to 
the  quartlles  are  highly  significant. 

The  criteria  for  rise  and  decline  use  quar- 
tiles.  The  crltt  la  for  baseline  and  plateau  use 
the  median.  This  is  again  a  definite  bias  in  the 
allocation  scheme  to  set  time  points  to  a  'flat' 
phase,  rather  than  a  'changing'  phase.  Our  exper¬ 
ience  has  shown  that  setting  more  restrictive 
criteria  for  the  baseline  and  plateau  r-'.ises  many 
short  cycles  to  be  identified  within  a  plateau  or 
a  baseline.  These  new  cycles  are  very  short  and 
do  not  correspond  to  our  understanding  of  the 
underlying  physiology  of  the  rhythm  under  study. 

The  algorithm  is  designed  to  choose  among  the 
possible  phases  and  to  select  that  phase  that 
enhances  the  cycling.  Therefore,  the  clean 
appearance  of  the  result  is  .  ot  a  proof  cf  the 
cycling.  As  indicated  above,  we  assume  that 
other  more  standard  methods  have  been  used  to 
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test  for  changes  in  the  height  of  the  signal, 
which  corresponds  to  testing  for  the  presence  of 
at  least  two  different  levels  of  activity.  This 
algorithm  is  intended  to  identify  the  location  of 
the  cycles. 

ALGORITHM  2:  CLUSTERING  NEIGHBORS 

Using  the  same  model  as  above,  it  is  possible 
to  view  the  time  sequence  of  data  points  as 
divided  into  clusters  of  time  points  with  similar 
values:  a  cluster  with  a  low  mean  value  is  a 
baseline  and  one  with  a  high  mean  is  a  plateau. 
Between  these  two  clusters  may  be  one  or  more 
clusters  with  Intermediate  values  corresponding 
to  a  rise  of  decline. 

The  algorithm  for  clustering  adjacent  points 
consists  of  the  following  steps: 

Step  1: 

Compute  the  mlnlm'jra,  maximum,  range  and 
standard  deviation  (SD)  of  the  sequence. 
Identify  as  an  outlier  any  point  that  differs  by 
one-half  the  range  from  the  mean  of  its  four 
neighbors  (two  time  points  on  either  side)  and  is 
the  minimum  or  maximum  of  these  five  points. 
Outliers  are  then  eliminated  from  calculations  in 
the  remainder  of  the  algorithm. 

Step  2: 

Set  each  time  point  to  be  a  cluster.  Combine 
adjoining  time  points  if  they  differ  by  less  than 
1*  of  the  range.  (This  eliminates  the  need  to 
combine  similar  observations  by  repetitive  steps 
in  the  algorithm  described  below,  j 

Step  3: 

Compute  the  distances  between  adjoining 
clusters  and  clusters  separated  by  a  single 
intermediate  cluster  where  the  measure  of 
distance  is  a  t-llke  statistic  defined  by: 

I,et 

t  -  d  /  SD 

_  _ 

-  nj^n2  (Xj^  -  X2)^  /  (n^  +  02) 


the  distance  between  the  two  outermost  clusters 
(tj)  is  larger  than  a  criterion  (discussed  below) 
and  the  mean  value  of  the  intermediate  cluster  is 
intermediate  to  the  means  of  the  two  adjoining 
clusters.  The  rationale  for  these  three  excep¬ 
tions  are: 

1)  When  two  clusters  have  similar  means  but  are 
separated  by  a  single  point  that  has  not  been 
amalgamated  into  either  cluster,  the  single  point 
may  represent  a  noisy  signal  that  should  be  eli¬ 
minated  from  the  clustering  process.  An  excep¬ 
tion  to  this  argument  is  when  the  clusters  are 
being  initially  formed;  at  which  point  random 
variation  is  likely  to  create  this  type  of  pat¬ 
tern.  Therefore,  criterion  (1)  is  applied  only 
If  the  combined  cluster  would  have  at  least  ten 
observations  and  there  are  at  least  two  obser¬ 
vations  in  each  of  the  original  clusters. 

2)  Since  the  clustering  algorithm  is  a  stepwise 
procedure,  it  need  not  be  optimal  in  its  choice 
of  clusters.  Therefore,  each  time  two  clusters 
are  merged,  it  is  desirable  to  check  that  the 
cluster  was  formed  from  the  optimal  split  into 
two  clusters  (as  if  the  clustering  algorithm  was 
running  In  the  reverse  direction)  .  Wlien  the 
combined  cluster  can  be  divided  into  two  adjacent 
clusters  at  a  different  cutpoint  for  which  the 
within  cluster  sura  of  squares  is  less  than  the 
within  cluster  sum  of  squares  of  the  original  two 
clusters,  the  two  clusters  are  realigned  by 
choosing  a  new  boundary  that  corresponds  to  the 
minimum  within  cluster  sum  of  squares. 

3)  When  two  nonadjoining  clusters  (separated  by 
a  single  cluster)  differ  greatly  in  their  mean 
values  (tj  >  1.414  *  criterion  described  below), 
the  intermediate  cluster  may  represent  a  rise  or 

decline.  When  the  mean  of  the  intermediate 
cluster  is  approximately  midway  (40-604)  between 
the  means  of  the  two  neighboring  clusters,  then 
the  cluster  is  not  included  when  computing  the 
rain  t2  criterion  for  stopping  the  clustering 
algorithm.  However,  if  the  stopping  criterion  is 
not  fulfilled,  then  the  cluster  will  be  combined 
with  its  neighbor  if  t2  for  the  cluster  is  less 
than  min  t2 ■  i®-.  these  two  clusters  are  identi¬ 
fied  as  the  two  to  be  combined. 


and  Xj^,X2  are  the  means  of  the  two  clusters  and 
nj , 02  are  their  sample  sizes.  Then  t  is  distri¬ 
buted  as  a  t-statlstic  when  there  are  no  clusters 
in  the  series  and  will  be  bounded  above  by  the 

distribution  of  a  t-statistlc  when  there  are 
clusters  (since  the  within  cluster  pooled 

standard  deviation  must  be  less  or  equal  to  the 
SD  from  the  entire  series). 

Let  t2  represent  the  measure  of  distance 
between  two  adjacent  clusters  and  tj  that  between 
two  clusters  that  are  separated  by  a  single 

cluster.  Min  t2  and  min  t-,  will  represent  the 

minimum  values  of  these  statistics  across  all  the 

clusters . 


Step  4; 

Combine  the  two  adjoining  clusters  with  min  t2 
unless:  (1)  two  clusters  with  similar  mean  values 
are  separated  by  a  single  point  (t-j  <  min  12),  or 
(2)  the  cluster  formed  by  combining  the  two 
clusters  with  min  t2  could  be  redivided  into  two 
clusters  with  a  smaller  within  cluster  sura  of 
squares,  or  (3)  the  two  clusters  are  part  of  a 
sequence  of  three  consecutive  clusters  such  that 


Step  5 ; 

If  the  criterion  for  stopping  (discussed  below) 
is  fulfilled,  then  print  out  the  current 
clusters.  Otherwise,  return  to  step  3. 

Stopping  criterion: 

When  the  usual  two-sample  t-test  is  used  to 
test  for  the  equality  of  levels  between  clusters, 
the  method  of  ‘dentifying  clusters  will  over- 
identlfy  the  number  of  clusters  because  the 
clusters  are  chosen  to  maximize  the  t-statistic 
(by  combining  at  each  step  clusters  that  minimize 
the  t-like  statistic).  Therefore,  a  conservative 
criterion  for  cluster  identification  is  desired. 

The  statistic  t2  -  d/SD,  where  SD  is  the 
standard  deviation  of  the  original  sequence,  is 
less  than  a  t-statistic  based  on  the  within 
cluster  variance.  Therefore,  tests  based  on  this 
statistic,  will  reject  the  null  hypothesis  loss 
than  tests  based  on  the  within  cluster  variance. 
An  approximate  relationship  between  the  two 
statistics  (based  on  only  two  clusters  in  the 
entire  sequence)  is: 
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^2  “■  - - 

[  1  +  (t^-l)/(ni^+n2-l)  1^/2 
where  is  the  statistic  proposed  here  and  t  is 
the  pooled  two-sample  t- statistic.  Therefore,  t2 
is  less  than  t  whenever  t  is  greater  than  1.  For 
small  sample  sizes  the  disparity  may  be  large; 
for  large  sample  sizes  the  disparity  will  be 
large  only  when  t  is  also  large  (but  then  t2  will 
also  be  significant. 

In  our  analyses  critical  values  corresponding 
to  0.05,  0.01  and  0.001  were  used.  The  inter¬ 

mediate  value  was  sufficient  to  eliminate  many 
small  clusters. 

The  criterion  for  d^  was  set  at  1.414  times 
that  for  d2  in  order  to  adjust  for  the  number  of 
comparisons  (searching  across  three  adjacent 
clusters  instead  of  two) . 

To  stop  the  clustering  algorithm,  either  each 
ty  must  be  greater  than  the  critical  value  or  the 
cluster  must  be  contained  within  a  set  of  three 
clusters  for  which  t-j  is  greater  than  1.414  times 
the  critical  value.  The  first  and  last  cluster 
of  the  entire  sequence  do  not  have  to  fulfil 
these  criteria  in  order  for  stepping  to  terminate 
since  the  sequence  may  have  began  or  ended  during 
a  rise  or  decline. 

EXAMPLES 

In  Figures  1  and  2  we  present  the  logarithms  of 
the  values  of  luteinizing  hormone  (LH)  in  ovari- 
ectomized  ewes  that  were  sampled  twice  per  week 
for  more  than  four  years.  The  first  ewe  (Figure 
1)  was  kept  outdoors  and  therefore  subject  to  an 
annual  photoperiodic  stimulus  and  the  second  ewe 
(Figure  2)  was  kept  in  a  controlled  light  envi¬ 
ronment  (eight  hours  light  per  day)  to  study  the 
effect  of  the  disruption  of  the  photoperiodic 
stimulus . 


In  each  figure  there  are  two  sets  of  lines. 
The  uppermost  set  of  lines  are  the  clusters  (as 
identified  by  the  clustering  algorithm)  plotted 
at  the  mean  level  of  hormone  for  the  cluster. 
The  raw  data  values  (on  a  log  scale)  are  scat¬ 
tered  about  the  cluster  lines.  In  several  sec¬ 
tions  there  appear  to  be  more  than  one  cluster 
(see  arrows);  the  more  detailed  cluster  structure 
(smaller  cluster  groups)  were  obtained  using  a 
critical  values  for  t2  of  1.96  and  the  longer 
lines  were  obtained  using  a  critical  value  of 
2.576. 

The  lower  schematic  in  each  figure  presents  the 
results  of  the  regression- like  algorithm.  Values 
are  graphed  on  four  levels,  the  lowest  is  the 
baseline  and  highest  is  the  plateau.  A  cycle 
would  contain  both  a  baseline  and  a  plateau  and  a 
return  to  a  baseline.  As  may  be  noted,  there  are 
many  cases  of  a  plateau  followed  by  a  rise  or 
decline  followed  by  another  plateau  (and  similar¬ 
ly  for  baselines).  This  suggests  a  change  in 
level  of  the  plateau,  but  not  a  full  cycle. 

Although  there  are  relatively  good  agreements 
between  the  (extended)  baselines  and  plateaus  and 
the  clusters,  the  patterns  identified  by  the 
regression  algorithm  appear  to  be  noisier  than 
those  identified  by  the  clustering  algorithm. 
However,  the  clustering  algorithm  may  not  identi¬ 
fy  rises  and  declines.  Also,  since  the  cluster¬ 
ing  algorithm  assumes  homogeneous  variance  on  the 
transformed  scale  for  the  entire  series,  it  is 
less  likely  to  Identify  cycles  whose  nadir  to 
peak  amplitude  is  relatively  small  compared  to 
the  overall  SD 

The  experiment  was  designed  to  study  whether 
the  annual  rhythm  is  disrupted  by  removing  the 
photoperiodic  stimulus.  Note  the  regularity  of 
the  clusters  for  the  control  animal  (Figure  1) 
but  the  change  in  pattern  of  the  cycles  over 
years  under  constant  light  conditions  (Figure  2). 


Figure  1  .  Luteininzing  hormone  (lH)  levels  of  an  ovarlectomized  ewe  («1006)  maintained 
outdoors.  The  data  points  represent  levels  of  LH  in  blood  samples  taken  twice  per 
week  starting  on  May  24,  1983  and  ending  on  22.  1988. 

a)  Data  points  (on  log  scale). 

b)  There  is  a  lower  threshhold  to  the  sensitivity  of  the  radioimmunoassay  for  IH. 

Therefore,  samples  at  the  threshhold  appear  to  f'-t —  straight  lines  -imilar  to 

clusters . 

c)  Clusters  are  represented  by  straight  lines  at  the  average  level  of  LH  in  the  cluster. 

Critical  values  of  1.96  or  of  2.576  were  used.  Arrows  Indicate  where  the  solutions 

differ.  The  shorter  lines  are  clusters  formed  by  using  1.96.  The  average  level  of 

the  two  smaller  clusters  when  combined  together  is  equal  to  the  average  level  of  the 
combined  cluster. 

d)  Phases  due  to  the  regression  algorithm.  The  four  levels  (starting  from  the  top) 

correspond  to  the  phases:  plateau,  decline,  rise  and  baseline. 
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Figure  2.  Luteininzing  hormone  (LH)  levels  o^C  an  ovariectomized  ewe  (#2021)  maintained  in 
a  controlled  light  environment  (eight  hours  of  light  per  day).  The  data  points 
represent  levels  of  LH  in  blood  samples  taken  twice  per  week  starting  on  March  18, 
1983  and  ending  on  Jan  22,  1988.  See  the  legend  of  Figure  1  for  explanation  of 

symbols . 


DISCUSSION: 

We  have  presented  two  algorithms  to  identify 
phases  of  a  cycle. 

The  first  algorithm  identifies  all  four  phases 
of  a  cycle;  however,  it  is  difficult  to  formalize 
statistical  tests  with  respect  to  Individual 
cycles . 

The  second  algorithm  identifies  the  baseline 
and  plateau.  The  rises  and  declines  are  at  times 
included  within  the  baseline  or  plateau.  An  ad 
hoc  test  for  the  presence  of  a  cluster  is  also 
presented. 

It  would  appear  that  a  combination  of  the  two 
algorithms  may  be  preferable  to  either  one 
individually.  For  example,  the  clustering 
algorithm  can  be  extended  so  that  a  straight  line 
is  fit  to  the  data  between  the  baseline  and 
plateau  (or  between  the  plateau  and  baseline)  so 
as  to  best  fit  (minimize  the  sum  of  squares  of) 
the  intervening  points.  To  do  so,  the  algorithm 
might  be  allowed  to  Include  (one  or  more  points 
adjacent  to)  the  endpoints  of  the  baseline  and 
plateau  in  the  regression  calculation.  This  may 
improve  the  estimation  of  the  intermediate 
phases . 

The  major  advantage  of  an  algorithm  is  that  it 
will  allow  comparison  among  experimental 
conditions  in  a  consistent  manner.  Currently, 
investigators  eyeball  their  data  to  identify 
different  phases.  Questionable  data  points  are 
assigned  in  a  very  subjective  manner.  This  algo¬ 
rithm  (with  additional  tuning)  should  duplicate 
the  investigator's  clearcut  assignments  and  then 
assign  the  questionable  points  in  a  consistent 
manner . 
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OPTIMIZATION  IN  THE  DESIGN  OF  SEQUENTIAL  CLINICAL  TRIALS 

Richard  Simon,  National  Cancer  Institute 


1.  Introduction 

I  use  the  word  'optimisation*  in  the  title  with  some  hesitation. 
'Optimum*  clinical  trials  are  those  that  address  an  important 
medical  question,  obtain  a  reliable  and  timely  answer  and  are 
reported  responsibly  in  the  medical  literature.  I  will  use  the  term 
'optimization'  here  in  a  more  limited  and  technical  sense  to  refer 
to  efficiency  in  the  conduct  of  a  clinical  trial.  I  will  describe  several 
applications  of  optimization  to  the  design  of  simple  sequential 
clinical  trials. 

S.  Phase  II  Clinical  Trials 

A  phase  11  study  of  a  cancer  treatmeiit  is  an  uncontrolled  trial 
for  obtaining  an  initial  estimate  •'f  the  degree  of  anti-tumor  effect 
of  the  treatment.  The  proportion  of  patients  whose  tumors  shrink 
by  at  least  50%  is  called  the  response  rate  and  is  the  statistic  of 
primary  internet.  The  purpose  of  a  phase  11  trial  is  to  determine 
whether  the  drug  has  sufficient  activity  against  a  specific  type  of 
tumor  to  warrant  its  further  development. 

The  designs  described  here  are  based  on  testing  a  null 
hypothesis  H(yp<pQ  that  the  true  response  probability  is  less  than 
some  uninteresting  level  Po  If  the  null  hypothesis  is  true  then  we 
require  the  probability  should  be  less  than  a  of  accepting  the  drug 
for  further  study  in  other  clinical  trials.  We  also  require  that  if  a 
specified  alternative  hypothesis  Hi:p>pi  that  the  true  response 
probability  is  at  least  some  desirable  target  level  Pi  is  true  then 
the  probability  of  rejecting  the  drug  for  further  study  should  be 
less  than 

It  is  rarely  practical  to  utilize  a  sequential  design  that  requires 
re-analysis  of  the  data  after  treatment  of  each  patient.  Response 
assesment  may  take  weeks  or  months  and  so  the  most  popular 
approach  to  sequential  analysis  is  the  'group-sequential'  approach 
in  which  interim  analyses  are  performed  after  groups  of  patients 
are  treated  and  evaluated.  Let  n,  denote  the  number  of  patients 
treated  in  the  i’th  stage  of  a  phase  II  trial  and  let  denote 
total  number  of  responses  observed  through  the  end  of  the  i’th 
stage.  A  decision  rule  or  sequential  decision  boundary  may  be 
specified  by  a  set  of  pairs  (l,tU,)  where  we  reject  the  treatment 
after  the  i’th  stage  if  S^<l^  and  we  stop  the  trial  and  accept  the 
treatment  if  Otherwise,  the  trial  proceeds  to  the  i+l*st 

stage.  If  we  specify  the  maximum  number  of  stages  I  and  the  error 
limits  a  and  0  for  the  hypotheses  based  on  Pq  and  Pi  then  one  may 
consider  the  optimization  problem  of  finding  the  sample  sizes  (n,) 
and  the  sequential  boundaiea  for  i=l,...,I  which  satisfy  the 

error  probability  constraints  and  minimize 

/ E\N  I  p1  tu(p)  <f/(p). 

N  denotes  the  number  of  patients  treated  in  the  trial  before 
termination.  N  is  a  random  variable  with  maximum  value 
fii-f.-.+  ri;.  E[N  I  p]  denotes  the  expected  value  of  N  when  the 
true  response  probability  is  p.  The  function  to  be  minimized  is  the 
expected  sample  size  averaged  with  regard  to  a  prior  distribution  f 
for  the  unknown  response  probability  p.  The  average  is  also 
weighted  by  a  function  w  which  specifies  the  relative  importance  of 


the  sample  size  for  different  values  of  the  response  probability. 

I  (Simon  19fi7)  have  published  designs  with  2  stages  based  on 
minimizing  the  expected  sample  size  when  p=Po  The  designs 
were  limited  to  two  stages  because  that  is  often  a  practical 
constraint.  It  may  be  necessary  to  stop  entering  new  patients  onto 
a  phase  li  trial  at  the  end  of  a  stage  until  one  determines  whether 
tiie  conditions  for  continuation  have  been  achieved.  Because  of  the 
delay  in  evaluating  response,  this  may  require  suspension  of  accrual 
for  weeks  or  months.  Such  suspension  is  an  inconvenience  to 
physicians  who  are  deciding  how  to  treat  patients  and  may  not  be 
tolerated  more  than  once  in  the  course  of  a  study.  I  used  u;(po)=l 
and  tu(p)=0  for  all  other  values  of  p  as  a  simple  way  of 
representing  the  importance  of  minimizing  the  number  of  patients 
given  an  ineffective  drug.  With  this  formulation  it  was  not 
necessary  to  specify  the  prior  distribution  f.  Obviously,  more 
general  specifications  are  possible.  For  many  phase  11  clinical 
trials,  there  is  no  strong  desire  to  terminate  early  if  the  treatment 
appears  effective  because  there  arc  secondary  endpoints  of  interest. 
If  the  treatment  is  inactive,  however,  the  trial  should  terminate  as 
early  as  possible.  Consequently,  1  set  Uj=ni-fn3  and  optimized 
with  regard  to  the  parameters  ,n2  ,/j  and  ly 

For  specified  values  of  Po »  Pi  >  or  and  0  optimal  designs  were 
determined  by  enumeration  using  exact  binomial  probabilities. 
For  each  value  of  total  sample  size  n  =  nj+n2  and  each  value  of  nj 
in  the  range  {l,n-l)  the  integer  values  of  li  and  Ij 
determined  which  satisfied  the  two  constraints  and  minimized  the 
expected  sample  size  when  p=Po  This  was  found  by  searching 
downward  over  the  range  lic(0  ,  /f);  1*  is  the  largest  integer  for 
which  B{li  ;  pj  ,  nj)  <  0  where  B  denotes  the  cumulative 
binomial  distribution  function.  For  each  value  of  Ij  we  determined 
whether  there  was  a  value  of  I2  such  that  the  design  (n,ni 
satisfied  both  type  1  and  type  2  error  constraints.  If  not,  then  we 
continued  our  downward  search  on  Ij.  If  the  design  satisfied  the 
constraints,  then  it  was  optimal  for  those  values  of  n  and  n^. 
Keeping  n  fixed,  we  searched  over  the  range  of  n|  to  fmd  the 
optimal  two-stage  design  for  that  maximum  sample  size  n.  The 
search  over  n  ranged  from  a  lower  value  of  about 


where  p=(po+Pi)/2  and  the  z  values  are  percentiles  of  the 
standard  normal  distribution.  We  checked  below  this  starting 
point  to  ensure  that  we  had  determined  the  smallest  sample  size  n 
for  which  there  was  a  nontrivial  (»Vi,n2>0)  two-stage  design  which 
satisfied  the  error  probability  constraints.  The  enumeration 
procedure  searched  upwards  from  this  minimum  value  of  n  until  it 
was  clear  that  the  optimum  had  been  determined.  The  minimum 
expected  sample  size  for  fixed  n  is  not  a  unimodal  function  of  n 
because  of  discreteness  of  the  underlying  binomial  distribution. 
Nevertheless,  eventually  the  local  minima  increased  and  a  global 
minimum  was  identified.  Calculations  were  carried  out  in  APL  on 
a  Microvax  II  computer.  Table  1  shows  some  optimal  designs  for 
the  case  a=^=0.10.  In  Table  1  N„^^  nj  +03  ,  ^o(^)  “  *-he 
expected  sample  size  when  the  null  hypothesis  is  true,  and  PETq  is 
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the  'probability  of  early  termination*  (after  the  first  stage)  when 
the  null  hypothesis  is  true.  Designs  for  large  values  of  po  are  not 
usually  appropriate  for  the  testing  of  new  drugs  but  can  be  useful 
for  pilot  studies  of  combinations  of  drugs  known  to  be  active. 

S.  Phase  If/fff  (yinicu! 

In  cancer  therapeutics  it  is  conventional  to  obtain  'promising' 
data  for  a  new  treatment  in  an  uncontrolled  phase  11  study  before 
initiating  a  randomized  comparison  to  an  established  regimen.  An 
alternative  approach,  however,  would  employ  a  randomized  design 
from  the  outset  with  early  termination  if  preliminary  results  are 
not  promising  for  the  experimental  treatment.  Such  an  approach 
avoids  the  difficulties  of  interpreting  results  arising  from  an 
uncontrolled  study. 

Thall,  Simon,  Ellenberg  and  Shrager(  1988a)  developed  two- 
stage  designs  of  this  type  for  clinical  trials  where  the  endpoint  is 
binary,  response  or  no  response.  These  designs  test  a  null 
hypothesis  Hq:  A  =  0  against  a  one-sided  alternative  :  A  >  0 
where  A=Pe— p<.  and  p*  ,  Pc  denote  the  response  probabilities  for 
the  experimental  and  control  treatments  respectively.  At  the  first 
stage  rii  patients  are  randomly  assigned  to  receive  each  of  the  two 
treatments.  Let  denote  the  observed  number  of  responses  with 
treatment  i  in  the  first  stage  and  let 
Ai  =  (r.-rj/n  i ,  p.i  =  ('■.i+r,i)/2B  i  and  f  j  =  1  -  p  i-  If 

Ai  ^ 

(2p  .i9.i /n 

then  continue  to  the  second  stage;  otherwise  terminate  the  trial 
and  accept  Ha- 

In  the  second  stage  nj  patients  are  randomly  assigned  to 
receive  each  of  the  two  treatments.  Let  A,  and  p  be  defmed 
similarly  to  the  related  quantities  above  but  based  on  all  data  from 
the  two  stages.  At  the  end  of  the  second  stage,  if 

_ 4 _ 

(2p.f,  /(n  i+n^))'/’ 

then  reject  Hq  ;  otherwise  accept  Hq.  The  constants 
^1  •  ^2  *  Vi  >  V2  chosen  to  minimize  an  average  expected  sample 
size  subject  to  error  probability  constraints.  The  probability  of 
rejecting  Hq  should  be  no  greater  than  a  specified  a  whenever  the 
null  hypothesis  is  true,  regardless  of  what  the  common  value  of  the 
response  probability  is.  The  probability  of  rejecting  the  null 
hypothesis  should  be  at  least  1-0  for  a  specified  alternative 
d-{Pe,Pe)‘  We  minimized  the  expected  sample  size  averaged 
equally  over  Hq  and  over  9,  Minimization  of  Ey{H)  alone  produces 
designs  with  low  probability  of  stopping  at  the  end  of  the  first 
stage  and  hence  relatively  poor  performance  when  the  null 
hypothesis  is  true.  Minimization  of  £o(A^)  alone  produces  designs 
with  high  probabily  of  early  termination  under  //q  but  large 
maximum  sample  sizes,  and  consequently  poor  performance  under 
the  alternative.  Although  we  could  have  minimized  relative  to  a 
prior  and  weight  function  on  the  space  of  (p(,Pc)i  the  simple 
approach  that  we  used  resulted  in  non-extreme  designs  with 


generally  good  performance  under  both  the  null  and  alternative 
hypotheses.  Performance  under  the  alternative  hypothesis  could 
be  improved  by  permitting  rejection  of  the  null  hypothesis  after 
the  first  stage.  Thall  et.  al.  (1988a)  show  how  this  can  be 
accomplished  by  superimposing  an  early  rejection  rule  of  the 
C'Diicii-Fleming  (197G)  type. 

The  optimization  problem  was  solved  on  a  DEC-10  computer 
using  MLAB,  an  interactive  mathematical  modeling  program  with 
built-in  integration  and  curve  fitting  capabilities  (Knott  1979). 
For  each  selected  pair  (nj ,  nj,  we  determined  values  of  and  y2 
by  solving  nonlinear  equations  representing  the  error  constraints. 
Zeros  of  the  two  equations  were  determined  by  solving  a  non-lineai 
regression  problem.  We  used  normal  approximations  to  the 
binomial  distribution  and  performed  numerical  integration  using  a 
variable  step  Adams-Moulton  predictor-corrector  method.  We 
determined  the  optimum  design  by  a  systematic  search  of  an 
integer  grid  of  (ni.nj  values. 

Table  2  shows  some  of  the  resulting  optimal  designs  for  a=0.05 
and  ^=0.20.  The  column  labled  represents  the  size  of  a 

single  stage  design  with  the  same  error  probabilities  for  the  null 
and  specified  alternative  hypotheses.  The  large  reduction  in 
expected  sample  size  under  the  null  hypothesis  is  obtained  with 
very  little  increase  in  maximum  sample  size  compared  to  the  fixed 
sample  design.  The  column  labled  PETq  represents  the 
probability  of  early  termination  after  the  first  stage  when  the  null 
hypothesis  is  true. 

Two-Stage  Selection,  ond  Testing  Designs 

In  clinical  research  there  are  often  several  experimental 
treatments  of  interest  but  too  few  patients  available  to  thoroughly 
evaluate  each  relative  to  a  control  therapy.  A  common  approach 
in  such  circumstances  is  to  first  select  the  experimental  treatment 
which  appears  most  promising  based  on  uncontrolled  pilot  studies 
and  then  compare  the  selected  treatment  to  the  control  in  a  large 
randomized  clinical  trial.  When  such  pilot  studies  are  performed 
at  different  institutions,  treatment  effects  typically  are  confounded 
with  other  factors  and  the  selection  of  a  most  promising  regimen  is 
problematic.  Thall,  Simon  and  Ellenberg  (1988b)  proposed  a  new 
approach  to  the  problem  of  identifying  the  best  of  several 
experimental  treatments  and  determining  whether  it  is  superior  to 
a  control.  We  developed  a  two-stage  design  for  use  with  binary 
endpoints. 

During  the  first  stage  rij  patients  are  randomly  assigned  to 
each  of  the  K  experimental  treatment  groups  and  patients  are 
randomly  assigned  to  the  control  group.  At  the  end  of  the  first 
stage  the  largest  observed  response  rate  for  the  experimental 
treatments  is  compared  to  the  observed  response  rate  for  the 
control  group.  If  the  standardized  normal  Z  value  for  that 
comparison  does  not  exceed  a  critical  value  y^  ,  then  the  clinical 
trial  is  terminated  and  no  experimental  treatment  is  claimed  to  be 
better  than  the  control.  Otherwise,  a  second  stage  is  conducted  in 
which  an  additional  patients  are  assigned  to  the  control 
treatment  and  to  the  experimental  treatment  with  the  greatest 
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response  rate  in  the  first  stage.  Thus,  at  most  one  experimental 
treatment  is  carried  over  to  the  second  stage.  The  treatment 
carried  into  the  second  stage  is  called  the  "selected'*  treatment.  At 
the  end  of  the  second  stage  the  selected  experimental  treatment  is 
compared  to  the  control  based  on  all  data  for  those  two  treatments 
obtained  in  either  stage.  If  the  standardized  normal  Z  value  for 
that  comparison  exceeds  a  critical  value  ,  then  the  global  null 
hypothesis  of  equivalence  ot  «ill  the  treaimenls  is  rejected  and  the 
selected  experimental  treatment  is  "chosen"  as  more  effective  than 
the  control.  Otherwise,  the  global  null  hypothesis  of  the 
equivalence  of  all  K+1  treatments  is  not  rejected. 

As  for  the  phase  II/III  design  described  above,  our  intent  was 
to  determine  the  parameters  ni,n2,yi,y2  minimize  an 
average  expected  sample  size  subject  to  constraints  on  the  type  1 
and  type  2  errors.  The  type  1  error  constraint  is  straightforward, 
the  probability  of  rejecting  the  global  null  hypothesis  when  it  is 
true  should  not  exceed  a  specified  a,  whatever  the  value  of  the 
common  response  probability.  The  nature  of  the  type  2  error,  or 
power,  constraint  was  much  more  complicated,  however,  because  of 
the  great  variety  of  alternative  hypotheses  possible  with  K+1 
treatments.  We  specified  a  generalized  power  constraint  in  the 
following  way.  An  experimental  treatment  whose  response 
probability  exceeded  that  of  the  control  by  at  least  a  pre-speciHed 
quantity  6^  is  called  "effective".  An  experimental  treatment  whose 
response  probability  exceeded  that  of  the  control  by  more  than 
but  by  less  than  62  is  called  'marginally  effective".  We  require 
that  if  there  is  at  least  one  effective  treatment  and  no  marginally 
effective  treatments  then  the  probability  of  choosing  an  effective 
treatment  as  better  than  the  control  must  be  at  least  1  ~  If 
there  are  marginally  effective  treatments  that  are  almost  as  good  as 
the  effective  treatments,  then  there  will  be  a  substantial 
probability  that  one  of  them  will  be  selected  and  chosen  instead  of 
an  effective  treatment,  but  the  difference  is  of  little  consequence. 
If  there  are  marginally  effective  treatments  that  are  much  worse 
than  the  effective  treatments,  they  will  have  little  influence  on  the 
probability  of  choosing  an  effective  treatment.  The  least  favorable 
configuration  for  this  constraint  is  that  with  one  experimental 
treatment  having  response  probability  exactly  62  greater  than  the 
control  and  the  remaining  K>1  experimental  treatments  having 
response  probabilities  exactly  greater  than  the  control. 

We  determined  the  values  of  the  design  parameters  to  minimize 
the  average  expected  sample  size,  weighted  equally  between  the 
null  hypothesis  and  the  least  favorable  alternative  configuration, 
subject  to  the  type  1  error  constraint  and  the  generalized  power 
constraint.  The  optimization  algorithm  was  based  on  a  grid  search 
over  fij  and  yj.  An  integer  grid  was  used  for  the  former  and  a  grid 
width  of  0.025  for  the  latter.  For  specified  values  of  and  , 
the  nonlinear  equations  for  type  1  error  and  generalized  power 
were  solved  for  the  parameters  ^2  7r~ni/{n  i+n2)'  Those 
equations  were  solved  to  an  accuracy  of  ±10”^  using  the  least 


squares  algorithm  of  Shrager(l970).  Regarded  as  a  function  of 
for  fixed  ,  the  average  expected  sample  size  had  two  distinct 
local  minima  in  all  cases.  A  fmer  grid  search  in  the  neighborhoods 
of  these  local  minima  was  carried  out  to  obtain  the  minimum  giver 
til-  As  a  function  of  n^,  this  minimum  is  unimodal,  thus  yeilding 
the  global  optimum. 

Table  3  shows  some  of  the  optimum  designs  determined  for 
Q  -  C.05  ,  ^  -  0.25.  The  probablliLli;.,  of  eail>  UrmLiiti  r.  ur.dcr 
the  null  hypothesis  are  generally  in  excess  of  0.50  and  the 
maximum  sample  sizes  are  less  than  single  stage  trials  with  similar 
design  objectives,  such  as  those  of  Dunnett[1984). 

5.  Conclusion 

I  have  presented  some  of  the  research  that  I  and  my  colleagues 
have  conducted  in  the  past  few  years  in  the  area  of  optimized 
sequential  designs  for  clinical  trials.  I  have  not  attempted  to 
present  a  review  of  related  work  by  others  although  this  topic  is 
one  of  increasing  interest  on  the  part  of  biostatisticians.  Although 
there  is  a  great  literature  on  sequential  designs  for  clinical  trials,  it 
is  only  recently  that  these  methods  have  seen  broad  application. 
Clinical  trials  are  complex  endeavors  and  the  simplest  designs  are 
often  the  most  practical.  For  this  reason  we  have  focused  on  two- 
stage  designs.  Even  with  such  simple  designs  it  is  possible  to 
achieve  substantial  reductions  in  required  sample  size  compared  to 
single  stage  designs.  Such  reductions  translate  into  reduced 
exposure  of  patients  to  ineffective  treatments  and  increased 
efficiency  in  the  process  of  discovering  effective  ones. 
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BAYES  ESTIMATION  OF  CEREBRAL  METABOLIC  RATE 
OF  GLUCOSE  IN  STROKE  PATIENTS 


P.  David  Wilson,  University  of  South  Florida  College  of  Public  Health 
Sung-Chenq  Huang  and  Randall  A.  Hawkins,  UCLA  School  of  Medicine 

Local  cerebral  metabolic  rate  of  glucose  (LCMRG)  is  defined  as  a  nonlinear 
function  of  the  rate  constants  in  a  three-compartment  model .  Data  for 
estimating  LCMRG  in  the  human  brain  is  obtained  by  PET  scanner  following 
injection  of  F-18  labeled  fluorodeoxyglucose.  Optimal  analysis  would  be  based 
on  scans  repeated  up  to  three  hours,  but  this  is  not  practical.  Nuclear 
medicine  scientists  have  therefore  developed  three  single  scan  (SS)  methods 
requiring  only  a  single  ten  minute  scan  taken  at  about  one  hour  post-injection. 
These  SS  methods  use  prior  information  in  the  form  of  mean  rate  constants  from 
the  normal  (healthy)  population,  but  are  not  Bayes  methods.  We  have  developed 
a  Bayes  method  which  can  be  used  with  a  single  scan.  For  brains  of  stroke 
patients  (which  contain  mostly  normal  tissue  and  some  ischemic  tissue) ,  the 
Bayes  method  uses  a  highest  posterior  density  criterion  to  choose  between  prior 
densities  from  normal  and  ischemic  tissue  populations.  Computer  simulation 
studies  show  that  the  Bayes  SS  method  is  superior  to  the  non-Bayes  SS  methods. 

KEY  WORDS:  Bayes  Estimation,  Compartmental  models.  Glucose  metabolic  rate. 

1.  INTRODUCTION  LGMRG  =  ( Pq/LC ) k^kj/ ( k2  +  kg)  (1) 


The  current  method  for  measurement  of 
local  cerebral  metabolic  rate  of  glucose 
(LCMRG)  utilizes  positron  emission 
tomography  (PET)  images  of  the 
concentration  of  F-IS  (a  positron- 
emitting  isotope  of  fluorine)  in  a  local 
region  of  brain  tissue,  obtained  after 
intravenous  injection  of  F-18  labeled 
fluorodeoxyglucose  (FDG),  and  while  LCMRG 
is  in  steady-state. 

Analysis  of  PET  data  is  based  on  the 
three-compartment  model  for  FDG  kinetics 
shown  in  Figure  1.  FDG  is  injected  into 
the  plasma  compartment,  from  which  it 
communicates  with  the  brain  tissue 
compartments.  Once  in  the  tissue,  FDG 
can  undergo  phosphorylation,  the  first 
step  in  glucose  metabolism.  From  the 
plasma  compartment,  FDG  is  also  lost  to 
urine  and  other  tissues  (not  shown  in 
Figure  1 )  . 
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Figure  1.  Compartmental  Model  for  FDG 
Kinetics.  kg,...,k4  are  the  FDG  rate 
constants.  C(t)  =  Cj^jt)  +  C2(t). 

Unlike  glucose,  FDG  does  not  proceed 
further  down  the  metabolic  path.  This 
allows  sufficient  accumulation  of  FDG  in 
brain  tissue  to  provide  relatively 
precise  positron  emission  count  stat¬ 
istics.  It  also  prevents  recirculation 
of  any  metabolic  products  containing  F- 
18,  which  would  contaminate  the  plasma 
compartment.  LCMRG  is  defined  as  the  net 
rate  of  phosphorylation  of  glucose. 
However  the  rate  constants  for  glucose 
are  not  the  same  as  those  for  FDG,  and  it 
has  been  shown  that 


where  Pq  is  the  capillary  plasma  concen¬ 
tration  of  glucose  ( req  ired  to  be  in 
steady  state)  and  LC  is  a  "lumped  con¬ 
stant".  For  our  purposes  LC  may  be  taken 
to  be  a  known  constant  which  accounts  for 
the  use  of  FDG  rate  constants  instead  of 
glucose  rate  constants. 

2.  MATHEMATICAL  MODEL 

With  respect  to  the  compartmental 
model,  measurements  of  P(t)  are  obtained 
by  repeated  sampling  of  a  peripheral 
vessel.  The  PET  scanner  provides  a  noisy 
version  of  C(t),  the  total  F-18  concen¬ 
tration  in  the  brain,  defined  as  C(t)  = 
Cg(t)  +  C2(t)  in  the  compartmental  model. 
As  an  approximation,  the  contribution  to 
the  PET  data  from  the  FDG  in  the  brain 
capillaries  is  usually  ignored.  (However 
a  more  general  formulation  including  this 
contribution  can  be  found  in  Hawkins, 
Phelps,  and  Huang,  1986.) 

From  a  linear  systems  viewpoint,  P(t) 
can  be  viewed  as  the  input  function  to  a 
linear  system  with  output  function  C(t) 
and  impulse  response  h(t;k),  described 
below,  where  k  is  the  set  of  rate 
constants.  The  differential  equations 
implied  by  the  compartmental  model  are 

dCi(t)/dt  =  kiP(t)  +  k4C2(t)-(k2+k3)Ci(t) 

dC2(t)/dt  =  k3Ci(t)  -  k4C2(t)  (2) 

where  P(t)  is  treated  as  a  known 
(measured)  function.  To  conveniently 
express  the  solution  of  equations  (2)  we 
define  "macroparameters",  a  =  (a^,  a2, 
ag,  a4)  as  follows: 

®4'®2  “  [ k2+k3+k4+{ ( k2+k3+k4 ) ^ -4k2k4 } ^ ] / 2 

®1  “  k3(k3+k4-a2)/(a4-a2) 

ag  =  k3(a4-k3-k4)/(a4-a2) . 
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(3) 


Then 
Ci(t)  = 

[k]^/(a^-a2)  ]  [  (k4-a2)exp(-a2t)  + 

{a4-k4)exp( -a4t) ]©P( t) 


C2(t)  = 

[k2^k3/(a4-a2)  ]  [exp( -a2t ) -exp(  a4t )  ]©P(t) 

C(t)  =  Ci(t)  +  C2(t)  = 

[aiexp(-a2t)  +  a3exp( -a4t) ]6P(t)  (4) 

where  ®  denotes  convolution  from  0  to  t. 
The  impulse  response  expressed  earlier  as 
h(t;k)  can  be  defined  in  terms  of  a  as 

h(t;a)  =  a3exp(-a2t)  +  a3exp(-a4t)  (5) 

so  that  the  output  function  is 

C(t)  =  h(t;a)®P(t) .  (6) 

Equation  (6)  is  the  mathematical  model 
for  the  expected  value  of  the  PET 
observations . 

3.  EXISTING  ESTIMATION  PROCEDURES 
3.1  Direct  Method 


For  research  purposes,  usual  nonlinear 
regression  methods  are  used  to  estimate 
the  rate  constants,  k,  or  the  macro¬ 
parameters,  a.  If  the  macroparameters 
are  estimated,  inversion  of  equations  (3) 
yields  estimates  of  the  rate  constants. 
Then  equation  (1)  is  used  to  estimate 


LCMRG . 

The  PET  scan 
usually  consists 
followed  by  ten 
by  ten  or  eleven 
total  scan  time 
hours.  The 
required  to 


data  collection  scheme 
of  ten  2-minute  scans 
5-minute  scans  followed 
10-minute  scans,  for  a 
of  approximately  three 
rapid  scans  at  first  are 
record  the  rapidly  changing 


brain  concentration  of  F-18  immediately 
after  injection.  Later,  as  the  brain 
concentration  changes  more  slowly,  longer 
duration  scans  arc  used  to  compensate  for 
loss  of  precision  due  to  decay  of  F-18, 
which  has  a  physical  half-life  of 
approximately  two  hours.  The  long  total 
scan  time  is  required  because  a2  is 
usually  on  the  order  of  10"^. 

Measurements  of  P(t)  are  taken  from  a 
peripheral  vessel  beginning  immediately 
after  injection.  P(t)  rises  extremely 
rapidly  to  reach  a  sharp  peak,  usually 
within  the  first  minute,  and  then  falls 
rapidly  at  first  before  beginning  a  more 
gradual  decline  after  about  ten  minutes. 
Samples  are  usually  taken  at  5  to  10 
second  intervals  for  the  first  3  minutes 
and  then  at  progressively  lengthening 
intervals  for  the  remainder  of  the  3  hour 
study.  The  samples  are  counted 
externally  in  a  well  counter  to  determine 
F-18  activity  and  are  calibrated  relative 
to  the  PET  observations. 

The  P(t)  data  are  generally  quite 


noise-free.  However  if  smoothing  is 
required,  a  nonparametric  smoothing 
algorithm  such  as  found  in  Wilson  (1988) 
can  be  used  to  smooth  from  the  peak  to 
the  end  of  the  study.  Samples  of  P(t) 
are  generally  sufficiently  closely-spaced 
so  as  to  allow  a  rather  accurate 
numerical  representation  of  P(t)  by 
simple  linear  interpolation  between 
sampling  times.  Convolution  of  P(t)  with 
exp(-at)  can  then  be  performed  by 
analytically  convolving  the  straight  line 
segments  with  the  exponential  function. 
Convolution  is  required  up  to  each  brain 
sampling  time. 

The  direct  method  is  recognized  to 
provide  accurate  estimates  of  LCMRG, 
conditional  on  the  given  value  of  the 
lumped  constant,  LC,  and  is  the  method  of 
choice  for  research.  For  routine 

clinical  studies,  however,  the  direct 
method  is  impractical  because  of  the 
three-hour  scanning  requirement.  Demand 
for  scanner  time  dictates  shorter 
duration  studies.  Furthermore  the 

difficulties  of  keeping  the  patient's 
head  immobilized  in  the  scanner  for  three 
hours  cannot  generally  be  overcome  in  a 
routine  clinical  setting.  For  these 
reasons  nuclear  medicine  scientists  have 
developed  several  methods  which  require 
only  a  single  PET  scan  of  duration  no 
greater  than  10  minutes.  These  are 
discussed  next. 

3.2  Non-Bayes  Single-Scan  Methods 

The  three  single  scan  (SS)  methods 
described  in  this  section  all  use  prior 
infiormation  but  are  not  Bayes  procedures. 
They  are  distinguished  from  the  Bayes  SS 
method  which  we  developed,  and  which  is 
described  in  the  next  section.  All  four 
SS  methods  require  only  a  single  PET 
scan,  of  duration  usually  10  minutes  and 
centered  at  time  t  =  T,  which  is  usually 
40  to  60  minutes  post-injection.  All  of 
the  SS  methods  make  use  of  estimates  of 
the  mean  values  of  the  rate  constants  or 
macroparameters  in  the  normal  population. 
These  estimates  are  available  from 
studies  employing  the  direct  method. 
(See  Huang  and  Phelps,  et_al,  1980.) 

Let  k  =  (1^3,  ^2'  ^3'  *^4  * 

estimates  of  the  normal  population  mean 
rate  constants.  Let  LCMRG(k)  be_LCMRG  of 
equation  (1),  evaluated  at  k.  Let 
C(T;k),  C3(T;k),  and  C2(T;k)  beC(t), 

C3(t),  and  C2(t)  of  equations  (4) 
evaluated  at  t  =  T  and  k  =  k,  and  with 
use  of  P(t)  from  the  subject  under 
measurement . 

Let  y(T)  be  the  PET  scan  measurement 
of  the  subject  at  t  =  T:  y(T)  =  C(T)  + 
noise.  The  first  non-Bayes  SS  method  for 
estimating  LCMRG  is  due  to  Sokoloff, 
Phelps,  and  Huang,  and  the  estimator  is 
denoted  herein  as  LCMRG(SPH): 

LCMRG ( SPH )  = 

LCMRG(k) [y(T)-C3(T;k) ]/C2(T;k) .  (7) 
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Note_  that  if  y(T)  were  replaced  with 
C(T;k)  in  equation  7,  LCMRG(SPH)  would 
simply  be  LCMRG(k) . 

All  of  the  foregoing  development  in 
sections  1,  2,  and  3  can  be  found  in 
greater  detail  for  the  biomedical  reader 
in  Phelps  and  Huang  et  al  (1979),  and 
Huang  and  Phelps  et  al  (1980). 

Let  a  =  (5]^,  a2,  13,  34)  denote  the 
estimates  of  the  normal  population  mean 
macroparameters,  (obtained  from  the  same 
source  as  k) .  The  second  non-Bayes  SS 
method  is  due  to  Brooks  (1982),  and  the 
estimator  is  denoted  herein  as  LCMRG(B): 

LCMRG(B)  =  LCMRG(k) [y(T)-a3exp(-a4T) 

©P(T) ]/[iiexp(-a2T)9P(T) ]  (8) 

where  P(t)  is  from  the  patient  under 
measurement.  Note,  again,  that  if  y(T) 
were  replaced  with  C(T;a),  LCMRG(B)  would 
simply  be  LCMRG(K) . 

The  third  SS  method  for  estimating 
LCMRG  is  due  to  Hutchins  and  Holden  et  al 
(1984),  and  the  estimator  is 

LCMRG(H)  = 

LCMRG(k)y(T)/[C3(T;k)  +C2(T;k)].  (9) 

As  before,  if  y(T)  were  replaced  with 
C(T;H).  then  LCMRG(H)  =  LCMRG(k). 
Hutchins  and  Holden  et  al  point  out  that 
this  estimator  is  independent  of  kj^  since 
LCMRG(H).  _C3(T;k),  and  C2(T;H)  are  all 
linear  in  k^,  and  the  term  cancels  in 
their  estimator.  They  argue  that  this 
should  be  of  value  in  studying  ischemic 
tissue  of  stroke  patients  since  kj^  will 
be  diminished  in  such  tissue. 

For  our  purposes  it  is  important  to 
point  out  that  Hawkins,  Phelps,  Huang, 
and  Kuhl  (1981)  studied  the  behavior  of 
LCMRG (SPH)  in  normal  tissue  and  the 
ischemic  tissue  of  stroke  patients.  They 
found  that  while  LCMRG (SPH)  behaved 
reasonably  well  in  normal  tissue,  it  had 
a  negative  bias  of  about  50%  in  ischemic 
tissue.  This  finding  was  confirmed  in 
simulation  studies  by  Wilson,  Huang,  and 
Links  (1984),  who  also  studied  LCMRG(B) 
by  simulation,  and  found  it  to  have  a 
negative  bias  of  about  35  to  40%  in 
ischemic  tissue.  Those  authors  also 
showed,  by  simulation,  that  if  k  or  a 
from  an  ischemic  tissue  population  is 
used  by  LCMRG (SPH)  and  LCMRG(B)  when 
studying  ischemic  tissue,  these  SS 
estimators  behave  quite  well. 

The  non-Bayes  SS  _methods  must  use 
prior  information,  k  or  a,  from  a 
specified  population.  Although  a  portion 
of  the  brain  of  a  stroke  patient  is 
ischemic,  perfusion  in  most  of  the  brain 
is  normal.  In  studying  such  a  brain  with 
a  non-Bayes  SS  method  there  are  two 
choices:  (1)  use  prior  information  from 
the  normal  population,  or  (2)  perform 
preliminary  perfusion  scans  to  determine 
the  perfusion  status  of  the  various 


regions  and  use  prior  information  from 
the  normal  or  ischemic  tissue  population 
in  studying  the  LCMRG  of  a  given  region, 
according  to  the  perfusion  status  found 
for  that  region.  We  defined  the  non- 
Bayes  SS  methods  as  using  prior 
information  from  the  normal  population 
because  that  is  the  way  they  are  usually 
used  clinically.  The  preliminary 
perfusion  scan  is  usually  avoided  because 
of  the  additional  scanner  time,  the 
additional  radiation  dose  to  the  patient, 
the  additional  effort  required  in 
evaluating  the  scan,  and  the  delay  this 
causes  in  estimating  the  LCMRG. 

We  have  developed  a  Bayes  procedure 
which  can  be  used  as  a  SS  procedure,  and 
which  can  choose  between  estimates  of 
LCMRG  based  on  prior  information  from  two 
sources  (normal  and  ischemic  tissue 
populations  in  our  case)  using  a  highest 
posterior  density  criterion  unavailable 
to  the  non-Bayes  methods.  The  Bayes 
procedure  is  described  next. 

4 .  BAYES  ESTIMATION 

Although,  formally,  the  Bayes 
estimates  of  a  should  be  converted  to 
rate  constant  estimates  (using  the 
inverse  of  equations  (3)),  and  these 
estimates  then  used  in  equation  ( 1 ) ,  we 
found  empirically  that  this  procedure  has 
some  negative  bias  which  can  be  partly 
eliminated  by  the  estimator 

LCMRG(Bayes)  =  (Pg/LOai  (10) 

where  a^  is  the  Bayes  estimator  of  macro¬ 
parameter  83.  The  justification  for  this 
choice  is  as  follows:  Brooks  (1984) 
pointed  out  that  =  |3R,  where  R  = 
k3k3/(k2  +  k3),  and  p-»l  as  k4->0.  In  the 
data  base  used  to  provide  estimates  k  and 
a  for  prior  information  (Huang  and  Phelps 
et  al),  we  found  that  3  =  1.05  with  very 
little  variation  among  individuals.  Thus 
using  a^  as  the  estimator  of  R 
compensates  for  some  of  the  negative  bias 
which  would  otherwise  occur. 

Let  the  set  of  mid-scan  times  be  t^, 
i=l,...,n.  Although  we  emphasize  the  use 
of  Bayes  Estimation  as  a  SS  procedure,  we 
describe  the  general  procedure.  Re¬ 
express  C(t3)  of  equation  (4)  as  C(t3;a) 
and  shorten  it  to  C3(a).  Let  y^  be  the 
PET  scan  observation  at  time  t^  so  that 
E(yi)  =  C;j^(a).  Let  y  =  (y^,  ^n'' 
(where  prime  denotes  transposition),  and 
define  a  =  (a^,  a2,  a3,  a4)'  to  be  the 
column  vector  of  the  macroparameters. 
Let  0  =  9(a,y)  denote  the  true  sum  of 
squared  errors:  0  =  8^=1  [  y^-C^  ( a )  ] ‘- .  Let 
the  variance  of  y^  be  v,  assumed  here  to 
be  constant  over  i.  Let  x  =  1/v  be  the 
"precision".  (It  is  more  convenient  to 
use  a  prior  density  for  x  than  for  v) . 
The  density  of  the  data  is  assumed  to  be 
Gaussian : 

fy(y|a,x)  <*  x‘^/2exp( -x9/2  )  .  (11) 
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The  prior  density  of  a  is  assumed  to  be 
Gaussian  with  mean  vector  covariance 

matrix  fl,  and  independent  of  r : 

fa(§:l^'^)  “  1^1“^ 

exp{-(a-^) 'fi-l(a-^)/2).  (12) 

The  prior  density  of  x  is  assumed  to  be 
Gamma  with  parameters  6  and  u  so  that 
E(t)  =  6/u  and  var(u)  =  6/u.^: 

f-T>(T|6,u)  «  T^”^exp( -UT ) ,  for  t,u,6  >  0. 

(13) 

(See  pp  3-5  of  Broemeling,  1985,  and  pp 
77-82  of  Vinod  and  Ullah,  1981.  In  the 
treatment  of  those  authors,  Q  in  equation 
(12)  is  replaced  with  Q/x.  This  is  of  no 
use  here  because  our  empirically 
estimated  prior  moments  are  ^  and  Q, 
whereas  t  is  un)cnown.  ) 

The  joint  posterior  density  of  a  and  x 
is  proportional  to  the  product  of  the 
right  hand  sides  of  equations  (11),  (12), 
and  (13).  After  integrating  x  out  of 
this  joint  posterior  density,  one  obtains 
the  marginal  posterior  density  of  a: 

P(a|y,ao,fi,6,x)  o, 

fa(al^,n)(2u+e]-(6+n/2).  (14) 

We  defined  our  Bayes  estimator  of  a  in 
terms  of  maximum  posterior  density  (MPD) 
estimation  for  computational  simplicity 
rather  than  using  minimum-Bayes-ris)c 
estimation  (Bard,  1974,  pp  61-75).  To 
maximize  the  posterior  density  in 
equation  (14),  one  must  solve,  for  )c  = 
1,2, 3, 4, 

^i=l  [Cki(a){yi-Ci(d) )]-[ (e+2u)/(n+26) ] 

[^j  =  l  (^j'^oj  )ii)cj  +  (a)t"aol:)^l:)4l/2  =  0  (15) 

where  C)^j^(a)=  2) C,  (a) /aay;,  Uj^j  is  defined 
by  =  U  =  (Uij),  and  agj^,  Ic  =  1,...,4, 
are  the  elements  of  a^. 

The  solution  of  equations  (15)  is 
obtained  twice:  once  using  the  prior 
moments  and  fi  from  the  normal  tissue 
population  and  once  using  these  prior 
moments  from  the  ischemic  tissue  popula¬ 
tion.  The  solution  producing  the  highest 
posterior  density  in  equation  (14)  is 
chosen  as  the  Bayes  estimate  of  a. 

5.  COMPUTER  SIMULATION  STUDIES 
COMPARING  BAYES  AND  NON-BAYES  SS 
METHODS  IN  ISCHEMIC  TISSUE 

Small  data  bases  of  rate  constants 
estimated  by  the  direct  method  have  been 
published  by  Huang  and  Phelps  et  al 
(1980)  for  normal  tissue,  and  by  Haw)cins, 
Phelps,  Huang,  and  Kuhl  (1981)  for 
ischemic  tissue.  We  consider  only  gray 
matter  tissue  here.  While  these  data 
bases  are  too  small  to  provide  prior 


moments  for  empirical  Bayes  analyses  of 
actual  human  data,  they  can  nevertheless 
form  the  bases  for  "simulation 
populations"  for  comparing  the  behavior 
of  the  four  SS  estimators  described 
above . 

To  allow  for  some  modeling  error  in 
the  prior  distribution  of  a,  and  at  the 
same  time  rule  out  any  negative  elements 
of  a  in  the  simulation  population,  we 
assumed  the  elements  of  a  to  be 
distributed  joint  lognormal  in  the 
simulation  population.  Let  L  and  S  be 
the  mean  and  covariance  matrix, 
respectively,  of  the  logs  of  the  elements 
of  a  in  the  simulation  population.  These 
moments  were  ta)cen  to  be  the  values 
computed  from  the  logs  of  the  elements  of 
the  a-estimates  in  the  data  base.  To 
obtain  the  a  for  a  simulated  subject,  we 
generated  a  pseudo-random  realization  of 
a  four-variate  Gaussian  random  vector 
with  moments  L  and  S,  and  then 
exponentiated  the  elements.  The 
macroparameters  for  all  simulated 
subjects  were  generated  from  the  ischemic 
simulation  population. 

After  generating  the  macroparameters, 
a,  for  a  simulated  subject,  the  impulse 
response,  h(t;a)  in  equation  (5),  was 
generated.  A  plasma  curve,  P(t),  for  the 
individual  was  then  generated  as  a 
combination  of  5  exponentials  with 
coefficients  randomly  selected  from 
ranges  seen  in  practice,  and  constrained 
so  that  C(t)  =  h(t;a)®P(t)  rises  over  the 
first  hour  to  match  clinical  experience. 
The  single  scan  time  chosen  was  T=1  hour. 
The  PET  data,  y(T) ,  was  then  created  as 
y(T)  =  C(T)  +  e,  where  e  was  a  pseudo¬ 
random  realization  of  a  zero-mean 
Gaussian  variate  with  standard  deviation 
0.05  C(T).  The  multiple  0.05  was  chosen 
because  in  the  ischemic  data  base,  in 
which  every  a-estimate  was  accompanied  by 
its  associated  mean-squared-error  (MSE) 
of  fit,  the  average  root  MSE  was 
approximately  5%  of  the  fitted  value  of 
C(T).  The  data,  {y(T),  P(t))  were  then 
analyzed  by  each  of  the  four  SS  methods. 

The  factor  Pq/LC  was  not  used  in 
estimating  LCMRG  because  all  results  were 
recorded  as  percent  error,  and  Pq/LC  is  a 
common  multiplier  in  both  the  true  LCMRG 
and  all  four  estimators. 

Prior  moments  (^,n)  were  available 
from  both  the  normal  and  the  ischemic 
simulation  populations  for  use  in  MPD 
Bayes  estimation.  The  prior  mean  from 
the  normal  population  was  used  in  the 
three  non-Bayes  SS  methods.  The  values 
of  u  and  6  used  in  the  Bayes  estimation 
were  obtained  as  follows:  Letting  m  and 
d  be,  respectively,  the  mean  and  variance 
of  the  reciprocal  MSE  values  in  the  data 
base,  we  solved  the  equations  m  =  6/u  and 
d  =  6/u^  for  u  and  6. 

The  simulation  studies  were  designed 
to  show  certain  characteristics  of  the 
distribution  of  percent  errors  of  the 
four  estimators  as  a  function  of  the  true 
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value  of  R  =  k2^k3/ ( k2+k3  ) .  We  simulated 
100  sets  of  data  in  each  of  8  intervals 
in  R  from  0.01  to  0.03  with  interval 
width  0.0025.  In  generating  the  data  for 
a  particular  interval  in  R,  macro- 
parcimeters  with  associated  value  of  R  not 
in  the  interval  were  discarded,  and 
generation  continued  until  100  sets  of 
macroparameters  with  R  in  the  specified 
interval  were  obtained. 

Results  of  the  simulation  studies  are 
shown  in  Figures  2,  3,  and  4.  Figure  2 
shows  the  root-mean-square  of  the  distri¬ 
bution  of  percent  errors  in  the  100 
analyses  by  each  of  the  four  estimators 
in  each  of  the  8  intervals  in  R.  Values 
of  R  in  the  last  two  intervals  on  the 
right  are  in  the  low  normal  range  (even 
though  all  macroparameters  were  generated 
from  the  ischemic  simulation  population). 
Because  the  Bayes  and  Hutchins-Holden 
procedures  are  distinctly  superior  to  the 
other  two  estimators.  Figures  3  and  4 
display  behavior  of  only  these  two 
superior  estimators. 

Figure  3  shows  the  mean  of  the  distri¬ 
bution  of  percent  errors  of  the  same  100 
analyses  by  the  Bayes  and  Hutchins-Holden 
methods  in  each  interval  of  R.  This 
figure  shows  that  most  of  the  inferior 
behavior  of  the  Hutchins-Holden  estimator 
is  due  to  a  negative  bias  of  about  12%  on 
average. 

Figure  4  shows  the  range  in  the 
distribution  of  percent  errors.  In  the 
range  0.0125  <  R  <  0.0275,  the  largest 
absolute  percent  errors  were  smallest  for 
the  Bayes  procedure. 

These  results  indicate,  as  expected, 
that  the  Bayes  SS  procedure  should  out¬ 
perform  the  three  non-Bayes  SS  procedures 
in  analysis  of  actual  human  data,  once  a 
sufficiently  large  data  base  becomes 
available  so  that  it  can  serve  as  the 
basis  for  empirical  prior  moments.  It  is 
a  tribute  to  the  Hutchins-Holden 
procedure  that  it  performs  as  well  as  it 
does.  Because  it  is  computationally  much 
less  burdensome  than  the  Bayes  procedure, 
LCMRG(H)  presents  a  challenge  to 
statisticians  to  develop  a  simpler 
procedure  which  can  outperform  it. 

A  report  of  this  work  for  biomedical 
readers  can  be  found  in  Wilson,  Huang, 
and  Hawkins  (1988). 
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ESTIMATION  OF  DEATH  DENSITY  USING  GROUPED 
CENSUS  AND  VITAL  STATISTICS  DATA 

John  J.  Hsleh, University  of  Toronto 


1.  INTRODUCTION 

This  article  develops  a  precise  method  for 
describing  the  distribution  of  the  lifetime  using 
census  population  and  vital  statistics  data  in  cross- 
sectional  studies.  The  description  of  the  lifetime 
distribution  will  be  made  through  estimation  of 
survival  function  pCx).  conditional  mortality 
probability  function  -  I  -  p(x*l]/p(x].  death  density 
function  f[x),  hazard  function  h(x)  =  f(x3/p(x3  and  life 

-OO 

expectancy  (mean  residual  life  furxrtioni  e(x)  -  J  p(y)dy/p(x). 

X 

To  reduce  randcvn  as  well  as  systematic  errors  in 
the  raw  data  employed  in  the  estimation,  population 
and  death  data  are  grouped  according  to  the 
convention  of  abridged  life  tables  in  which  the 
agespan  (O.oo)  is  partitioned  into  n  =  19  intervals 

(x,.x,^,],  i^.  1 .  18,  with  division  points  at  x,  -  1, 

Xj  »  5,  Xj  =  10 .  x,7  =  80  and  x,g  =  85,  so  that  the 

lengths  of  the  age  intervals  are  all  five  years 
except  for  the  first  two  intervals  which  are  one  and 
four  years,  respectively,  and  for  the  last  interval 
which  is  infinite.  (For  countries  with  high  quality 
data  the  last  division  point  may  be  taken  to  be 
x,9  =  90  so  that  there  are  n=20  age  intervals  all  told.)  To 
estimate  the  functions  for  the  lifetime  distribution 
in  a  cross-aectiortal  study  we  have  to  choose  a  base 
period  and  assume  that  the  observed  mortality 
schedule  in  the  base  period  remains  unchanged  over 
time.  To  increase  the  reliability  of  the  estimates. 
w«  shall  use  a  base  period  of  three  calendar  years. 

Accuracy  of  estimation  is  achieved  in  each  step 
by  appropriate  use  of  mathematical  approximations, 
numerical  quadrature  and,  in  particular,  spline 
interpolation,  differentiation  and  integration.  The 
available  mortality  data  for  the  subdivisions  of  the 
first  year  of  life  and  properties  of  the  life  time 
distribution  allow  an  accurate  determination  of  the 
two  er*d  conditions  for  the  spline  function.  The  use 
of  spline  methods  serves  to  further  smooth  out  errors 
arising  from  incomplete  reporting  and  other  sources 
that  still  remain  in  the  data  after  age  grouping. 

The  present  method  does  more  than  Just  provide  a 
new  way  of  calculating  the  conventional  life  table 

functions  (as  in  the  construction  of  abridged  life 

tables),  it  has  several  important  features:  (1).  It 
allows  one  to  calculate  fundamental  and  useful 
functions  of  the  lifetime  distribution  such  as  the 

death  density  furK:tion  and  the  hazard  function  that 
are  not  available  in  a  conventional  life  table,  (2). 
it  provides  more  accurate  estimation  of  the 
conventional  life  table  functions  as  well  as  other 
functions  not  found  in  published  life  tables  than 

existing  life  table  methods  o,  and,  most 

importantly.  (3).  it  allows  one  to  compute  these 

futKtions  at  any  age  point  and  for  any  age  interval, 
in  contrast  to  the  traditional  life  table  method 

which  gives  life  table  functions  only  at  the  age 

division  points  and  for  age  intervals  of  a  fixed  mesh 
corresponding  to  the  age  grouping  of  the  population 
and  death  data.  (See  Reed  and  Merrell  1939,  Chiang  1968, 
Keyfitz  arxl  Frauenthal  1975.)  Thus  the  present  method  not 
only  aOows  the  construction  of  complete  as  well  as  more 


refined  life  tables  from  abridged  life  tables  but 
also  expands  the  utility  of  the  life  table  so 
construct^. 

Section  2  describes  the  requisite  data  and 
calculation  of  death  rates  which  a'^e  used  in 
estimation  of  the  lifetime  distribution  in  Section  3. 
Methods  for  estimating  the  various  functions 
depicting  the  lifetime  distribution  are  described  in 
Sections  3.1  through  3.6.  Finally,  an  example 
illustrating  the  estimation  of  the  five  functions 
f(x).  h(x).  q^.  p(x)  and  e(x)  using  Canadian  data 
is  given  in  Section  4. 


2.  THE  DATA  AND  CALCULATION 


OF  DEATH  RATES 


The  following  data  are  required  for  the 
estimation  of  the  above  lifetime  distribution 
functions  according  to  the  format  of  conventional 
abridged  life  tables  using  the  proposed  procedures: 
(I).  Annual  number  of  births  Bj  for  each  calendar 
year  J  of  the  3-year  base  period  plus  one  year  preceding 
the  base  period,  (2).  deaths  dj  in  the  last  month  of  the 
first  year  of  life  for  each  calerxlar  year  j  in  the  base 
period.  (3).  mid-year  populations  Pjj  and  deaths  D,j  for  the 

ith  age  group  (i-O,  1 .  18)  and  jth  calerxlar  year 

(j=l,  2.  3),  i.e.,  for  the  first  year  of  life  (i=0) 
and  by  5 -year  age  groups  up  to  age  85  as  well  as 
those  aged  85*.  for  each  of  the  three  calendar  years 
in  the  base  period,  plus  infant  deaths  for  the  year 
preceding  the  base  period  CDqq).  These  data  are 
available  from  annual  reports  of  vital  statistics  arxl 
mid -year  population  estimates  or  censuses  published 
by  most  member  nations  of  the  United  Nations  and  are 
also  recorded  in  U.N.  Demographic  Yearbooks. 

In  order  to  estimate  the  l  retime  distribution 
by  the  proposed  procedures  one  has  first  to  calculate 
the  age-specific  deatn  rates  M,  from  the  death  and 
population  data  {D,j,P,j}.  The  death  rate  M,  for  the  ith 
age  group  in  the  base  perxxl  is  defined  as  the  total 
deaths  m  the  itb  age  group  over  the  base  period  A 
divided  by  the  total  observed  person-years  of  exposure 
during  the  ba:,c;  period  A  for  that  age  group.  In 
symbols. 


M, 


JPi(t)dt  ’ 


A 


..(1) 


3 

where  D,  -  D,,  is  the  number  of  deaths  Observed  during 

J=i 

the  3-year  base  period  for  the  ith  age  group  arxl 
P,(t)  .s  the  population  at  time  t  for  the  ith  age 
group.  From  formula  (1)  it  is  seen  that  the  problem 
of  calculating  the  death  rate  M,  reduces  to  the 
problem  of  numerically  integrating  out  the  person- 
year  integral  in  the  denominator  of  (I)  in  terms  of 
the  available  mid-year  population  data.  To  this  end, 
we  take  P,(t)  to  be  a  collocation  polynomial  (such 
as  represented  by  NewAon's  forward  formula)  of  ordw 
three  interpolating  to  the  three  prescribed 
population  data  in  the  base  period  and  perform  the 
indicated  integration  to  obtain: 
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M,  -  D/Y,,  . (2) 

where 

Y,  -  3(3P„-2P,2*3P,3)/8.  . P) 


is  the  estimated  person-years  of  exposure  for  the  ith 
age  group  using  the  numerical  quadrature  described 
above. 


Other  methods  of  numerical  quadrature  could  be 
used  but  the  results  would  only  differ  insignificantly 
from  that  given  by  (3)  (see  Hsieh,  1978).  In  fact 
formula  (3)  is  already  considerably  more  accurate 
than  the  traditional  method  of  calculating  the  death 
rate  by  dividing  the  average  number  of  deaths  in  the 
base  period  by  the  mid -period  population.  The  errors 
that  still  remain  in  the  population  data  will  be 
further  smoothed  by  spline  methods. 


3.  ESTIMATION  OF  THE  LIFETIME 
DISTRIBUTION 

In  this  section  we  shall  show  how  the  data  and 
death  rates  described  in  section  2  are  used  to 
estimate  various  noint  and  set  functions  representing 
the  distribution  of  the  lifetime.  The  procedures  for 
the  first  year  of  life  (i>0)  and  the  last  age 
interval  (i-18)  differ  from  those  for  the  remaining 
age  intervals.  These  two  extreme  age  intervals 
require  special  treatments  because  mortality  is 
extremely  high  and  declines  very  sharply  during  the 
first  year  of  life  and  because  in  the  last  open-ended 
age  interval,  the  data  tend  to  be  thin  and  not  of 
good  quality. 

The  method  starts  with  the  estimation  of 
conditional  mortality  probability  q.  for  the  age 

intervals  of  the  given  mesh  described  in  Section  I, 
using  mathematical  and  numerical  methods  of  approxim¬ 
ations.  This  is  followed  by  direct  estimation  and 
spline  interpolation  of  the  survival  fuixtion  at  any 
age  for  the  age  segment  (1.85).  Estimates  of  the 
death  density  and  hazard  functions  are  obtained  by 
spline  differentiation  and  life  expectancy  by  spline 
integration  of  the  survival  function.  For  ages  beyond 
85.  we  employ  the  Gompertz  law  of  mortality  to  derive 
the  estimate  of  these  functions.  Once  the  survive 
function  at  any  age  point  is  determined  from  spline 
(for  xs85)  or  from  Gompertz  curve  (for  x>85),  it 
becomes  a  trivial  matter  to  compute  the  conditional 
nK>rtality  probability  and  mean  residence  time  for  any 
age  intervals. 


3J  Estimation  of  Conditignal 
Mortality  Probability  q^ 

The  conditional  mortality  probability  q^  is  the 
probability  of  death  in  (x,jC|^i)  given  survival  to  age  Xj, 
i-0,  I.  ....  18.  They  are  estimated  as  follows: 

1.  For  the  first  age  interval  (HI) 

q^  -  l-0-e)0-b)  . («l) 


f  is  the  s^>aratian  factor,  normally  taken  to  be  Jl.  Dn  - 
3  2 

S^oj  is  the  infant  deaths  in  the  base  period.  Dq'  -  JjDn, 
1“*  j=o  ^ 

is  the  infant  deaths  in  the  3  calendar  years  startirg;  from 

the  year  preceding  the  base  period.  B  •  is  the  total 

1=1  ' 

births  in  the  base  poiod  and  B'  -  is  the  total  births 

m  the  3  caierxlar  years  starting  from  the  year  preoedirg 
the  base  period.  Formulas  (4)  and  (5)  are  derived 
from  probability  arguments.  Alternatively,  q  may  be 

obtained  from  construction  of  infant  life  tables  (see 
Hsieh.  1985X 

2.  For  the  central  age  intervals  G-l.  2 .  17) 

InO-q^)  -  -h,(M,*A,B,/Y,X  . (6) 

'*'**®*  snd  M(  aixl  Yj  are  given  by  equations 

(2)  and  OX  respectively,  and 
whae  (aX  for  K 


A,  -  (725Y,-418Y2-I62Y3)/12825. 

B,  -  (-ll20M,»1444M2-324M3)/855:  .  (7) 

(b) .  for  i-2 . 15, 

A,  -  fay,. r3Yr 51^1  *2^192. 

B,  -  (-3M,.,-3M,»7M,»,-Mi*2)/8j  .  (8) 

(c) .  for  i-16  and  17. 

Aj  -  (Y,.2*2Y,.,-3Y,X/48. 

B,  -  (M,.2-4M,.,OM,)/2i  (9) 


The  derivation  of  formulas  (6)  -  (9),  which  employs 
solution  of  an  integral  equation  using  Taylor 
expansion  aixl  Newton's  formulas,  is  given  in  Hsieh 
a988X 

3.  For  the  last  age  interval  (i-18) 

q|g  -  X  since  everybody  has  to  die  eventually. 

Estimation  of  oonditionBl  mortality  probability  for  any 
age  intervals  othw  than  (Xjjrj^p.  i-0.  L  ....  18.  is  given  in 
section  3.6. 


32  Estimation  of  Survival  Function 

Unlike  the  conditional  mortality  probability 
q^  discussed  in  Section  3.1  and  the  conditional 
mean  residence  time  aiKl  mortality  probabiUty  to  be 
discussed  in  Section  3.6  which  are  set  functions .  tlw 
remaining  four  functions  to  be  studied  in  Section 
32-3.5  are  all  point  functions. 

The  survival  function  p(x)  is  the  probability  of 
survivirg  to  age  x.  The  survival  function  p(x,)  at 
the  division  points  X|  is  obtained  directly  from 
the  conditional  mortality  probabilities  q  as  follows: 
For  K  2 . »,  ‘ 


a  -  a-03g/B. 
b  -  (Do/(B'-0-f)Do'X 


(5) 


•  TTo-q,x 

1=0  J 


00) 
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The  proof  of  (10)  is  straight  forward.  Clearly,  at 
the  two  ends  of  the  agespan.  p(0)-l  and  p(a>)-0. 

To  estimate  the  survival  function  at  ages  other 
than  the  division  points,  i.e.,  at  x^x^,  i-l,  2. 
....  18.  different  procedures  are  required  for  each 
of  the  three  main  age  segments.  For  ages  under  one 
3mar,  the  estimation  method  is  given  in  Hsidi  0985). 
For  ages  from  I  to  85  years  the  method  of  spline 
interpolation  will  be  used.  To  this  end,  we  pass  an 
interpolating  spline  Sp(*)  through  the  prescribed 
pCXf)  values.  H.  2.  ...  18.  and  take  this  spline  function  as 
an  estimate  of  the  survival  function  p(x).  for  aB  xeD,85). 
From  our  knowledge  of  the  pattern  of  the  survival 
curve  for  the  human  lifetime  and  the  availability  of 
the  mortality  data  in  cross -sectioital  studies,  the 
complete  irrterpolating  cubic  spline  would  be  the  best  choice 
among  all  polynomial  spline  functions  possessing 
optimum  approadination  properties. 

An  interpolating  cubic  spline  can  be  represented 
mathematically  in  a  number  of  ways  (see,  e.g.. 
Atilberg,  NQson  and  Walsh  1967.  de  Boor  1978  and 
Schumaker  1981).  We  shall  choose  the  following 
representation  for  the  cubic  spline  interpolating  to 
the  prescribed  data  p^  s  pOi,).  K  2 . 18:  For  x€(xijc,+p. 


sp(x)  - 


m 


(x,*,-x)^-*,)  (x-x^)hx^^,-*} 

i  h  2  ■  “i*!  ri 

h,  h, 


(Xt^.,-x)^(x-Xi)*ht)  ^  (x-x;)^(xt»i-x}^tl 

*  “  '  *  “  h.' 


'  •  •  *  P|^1  ^-3 


..00 


For  a  given  x  in  (X|j(|.^p.  i<  2 . 17,  all  quantities  in  00 

are  known  eacept  the  slopes  m|  •  s'p(X|)  at  the  division 
points.  These  parameters  are  determined  by  solving  the 
folowing  system  of  16  linear  equations. 


hjmj..j  ♦  2(hj*h^_i)mj  ♦ 


■P|^ 


iNp-p 


.02) 


or  equation  CD  raducss  to  Sp(x,)  -  p^  or  Sp(Xj^P 

-  p,^|.  respectively. 

For  the  last  open-ended  age  interval  (i.e.  for 
ages  beyond  X]s  =  85),  we  discard  the  data  in 
this  age  group  for  lack  of  reliability  arxl  adopt  the 
Gompertz  law  of  mortality.  By  fitting  the  Gompertz 
survival  curve  to  the  three  prescribed  survival 
functions  at  ages  X|^75,  x,7>80,  and  X|^5.  namely,  p^^. 
Pjy  and  Pjj.  we  obtain  for  t-x-x,^. 


p(x)  -  p„  gl'*-'!'"’  . 05) 

1 


and  pjg.  and  p^  are  given  by  QDX 


33  Estimation  of  Death  Density  Futxdion 

The  death  density  function  f(x)  is  the 
probability  per  unit  time  of  dying  in  the  instant 
immediatdy  following  age  x.  For  the  age  segment 
(1,85).  the  death  density  function  at  a  division 

point  X,.  H,  2 . 18  is  siiiHpiy  the  negative  of  the  slope  of 

the  survival  function  at  that  point.  The  spline 
estimate  of  the  death  deiBity  at  ages  other  than  the 
division  points  are  obtained  by  differentiating  the 
spline  estimate  of  the  survival  function  (II)  to 
yield.- 

s^  -  s'pOi) 

•  -h,"^{hj(x,,,-xX2x,*x,*,-3x)roj 
-h,(x-x,X2x,  *,*x,-3x)m,  *, 
♦6(x,*,-xXx-x,Xp,^,-p,D.  . 08) 


for  M,  3 . r7.  with  the  two  boundary  oonditians: 

(a)  the  first  eixlslape 


m,  -  -(365/31)p|d/(B-Oo-<li 
3 

where  d  -  ^d,. 

1=1 

(b)  the  last  entalope 

n»ia  -  -P,/<i7^/M,6^. 


.03) 


.04) 


Equation  02)  is  derived  from  equation  (ID  by 
differentiating  twice  with  respect  to  x  artd  using  the 
continuity  constraints  of  cubic  splines  at  the 
interior  division  points.  The  two  end  (boundary) 
conditions  (13)  and  (14),  which  estimate  the  slopes 
of  the  survival  function  at  the  two  boundaries,  are 
accurately  determined  from  properties  of  the  lifetime 
distribution.  The  tridiagonal  form  of  the  coefficient 
matrices  of  (12)  allows  the  linear  systems  to  be 
easily  solved  using  a  computer  by  Gaussian 
elimination  with  partial  pivoting.  Furthermore,  the 
diagonal  dominance  and  symmetric  characteristic  of 
the  matrix  guarantees  stable  results  with  minimum 
accumulation  of  rounding  errors.  Once  the  parameters 
m|  are  solved  for  from  (12)  with  bourxiary  conditions 
03)  and  04),  the  survival  function  at  any  age  x  can 
be  calculated  from  (D.  Note  that,  as  expect^  when  x>X| 


Finally  the  spline  parameters  m^  obtained  in 
Section  3.2  are  substituted  into  (18)  to  determine 
the  estimate  %(x)  of  the  death  density  function 
f(x).  Note  that  when  x-x^.  equation  Q8)  reduces  to 

sfiiih  -  m,,  . 09) 

as  pointed  out  at  the  beginning  of  this  subsection. 

For  ages  beyond  X|g-85.  the  death  density 
is  estimated  by  differentiating  (15)  with  respect  to 
t  to  get.  for  t-x-x,,2D, 

f(x)  -  -  p„0n  g)0n  c)c‘*‘V‘‘‘'’'“ . (20) 

where  Pj^.  c  and  g  are  given  by  (10).  06)  and  07X 
respectively. 

For  ages  uixier  one  year  the  estimation  method  is 
given  in  Hsieh  Q985). 

:i4  Frtiiwrtion  of  Heated  Functioo 
The  hazard  function  h(x)  is  the  conditional 
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probability  per  unit  time  of  dying  in  the  instant 
immediately  foBowing  age  x  gi^«n  survival  to  age  x. 
Once  ttw  death  density  function  and  survival  furtction 
are  determined  as  described  in  Sections  3.2  and  33 
above,  the  hazard  function  is  taken  as  their  ratio. 
For  X  in  0,85),  v*e  divide  (18)  by  (It)  to  get  a  spline 


estimate  of  the  hazard  functiont 

s,,(x>-  s,(x)/Sp(x).  . (21) 

Note  that  when  x-x,,  equation  (21)  reduces  to 

Sh(x,V  -  m,/P|.  . (22) 


in  view  of  (19)  and  the  comments  at  the  end  of  the 
paragraph  following  equation  04). 

For  ages  beyond  x,g>85.  we  divide  (20)  by 
05)  to  get,  for  t^x-X|^, 

Wx)  -  -  On  g)0n  c)c***“.  . (23) 


where  c  and  g  are  given  by  06)  and  07),  respectively. 

For  ages  under  one  year  the  estimation  method  is 
given  in  Hsieh  Q985). 

Since  differentiation  of  a  spline  results  in  a 
spline  (of  one  lower  order)  which  still  possesses 
optimum  approximation  properties,  the  use  of  spline 
method  of  differentiation,  unlike  other  numerical 
differentiation  procedures,  greatly  enhance  the 
accuracy  of  the  estimates  of  both  death  density  and 
hazard  functions.  Before  the  advent  of  spline 
functions,  the  difficulty  with  numoical  differentiaticn  has 
been  the  main  reason  why  conventional  life  tables  do 
not  include  death  density  and  force  of  mortality. 


3.5  Estanation  of  Life  Expectancy 

The  life  expectancy  at  age  x,  e(x),  is  the 
average  remaining  lifetime  for  a  person  alive  at  age 
x.  While  estimation  of  density  and  hazard  functions 
requires  differentiating  the  survival  function, 
estimation  of  life  expectancy  requires  integrating 
the  survival  function.  (Note  that  both  integration 
and  differentiation  of  splines  result  in  splines.) 
For  the  first  year  of  life,  we  use  the  mean  value 
theorem  of  int^al  calculus  to  obtain  an  estimate  of 
the  person-year  integral 

Lg  s  /‘p(x)dx  -  l-fl-f)q„.  . (24) 

where  f  is  the  separation  factor  defined  in  (5).  For 
ages  beyond  X|g  -  85.  we  integrate  05)  from  x  (>X|g)  to 
oo  to  grt  the  person-years  lived  beyorxl  age  x. 

T(x)  =  J"p(t)dt 

Z 

-  Pjg  g"'”*E,(-c*  ***ln  g)/ln  c. . (25) 

.0» 

where  the  exponential  integral  E|(k^  J  (e^'^/uMu 

k 

-  -y  -  In  k  -  ('»tth  y  -  0.5772156649...  being 

Euki's  oonstanA  'when  x  -  X|g.  (25)  becomes 


For  X  in  (x,j(i.,.i).  W.  2 . 17.  we  have  person-years 

lived  beyond  age  x-. 

T(x)-  J“p(t)dt  -/‘‘♦t(t)dt  ♦  J*%(t)dt  ♦  f  p(t)dt 
X  *  *!♦!  *ta 

-  I(x)  ♦  2  L,  ♦  T(Xtt),  . (27) 

jal’Tl 


where 

L,  -  J^*\(t)dt 

-  hi(p,*P,*,V2  ♦  hi»,^i-m,^p/l2 . (28X 

is  obtained  by  integrating  Ql)  from  x,  to  x,^.,  arxl 


I(X)  -/*'*'sp(t)dt 

-fe,*r>0(p(«)-p,,p/2-(x,*,-x)V(*>n>^  •••(29) 

with 


pr(x)  -  p“(Xi)(x,*i-x)/h, 
-p^,*,)tx,-«Vh, 

and 


(30) 


pr(xj)  -  -2hi‘^(2mjnD,*,)hj*3(pj-pj^j)] . OO 


pax,*p  -  2h,"^(m,*2m,*,)h,0(P|-p,^^X . 02) 


Note  that  for  W,  2 . 17,  I(x,)  «  Lj  and  1(X|4.|>4)  so  that 

at  the  division  points, 

T(*,)  -  S  Lj  *  T(x«)  . 03) 

and  that 


T(0)  -XL,*  T(x^  . 04) 

1=0 

With  tail  person-year  integral  T(x)  computed  as 
above,  the  life  expectancy  (mean  residual  life 
function)  is  estimated  by 

e(x)  -  T(x)/p(xX  . 05) 

where  for  X6D.85X  p(xVSp(x)  and  T(x)  are  given  by  (B  and 
(27),  respectively,  and  for  x>85.  p(x)  and  T(x)  are 
given  by  (15)  and  (25X  respectively.  Note  that  at  the 
division  points  x>x,.  05)  reduces  to 

eOr,)  -  (i  L,  ♦  Tto,8)yp,.  . 06) 

3.6  Elstmaticn  of  Conditkxral  Mortality  Probability 
and  Corxlitional  Mean  Resktence  Time  (GeneraQ 

For  0<x<y<z<<»,  the  conditional  probability 

of  death  in  [y.z]  given  survival  to  age  x.  denoted  by 
q(x;y,z].  is  the  ratio  of  the  difference  between  the 
survival  function  at  y  and  the  survival  function  at 
z  to  the  survival  function  at  x.  If  x,y,ze(l,85). 
then  this  conditional  mortality  probability  is 

eshraatad  by 

q{x;y.z]  -  (Sp(yX-Sp(z)l/Sp(xX  . 07) 


T(X|^  a  /"p(t)dt 
•i« 

-  Pjj  g}/ln  c . (26) 


where  the  spline  survival  functions  Sp(.)  are 

obtained  from  (B).  If  either  z  alone,  or  y  and  z,  or 
x.y  aixf  z  are  greater  than  85.  then  the  oorresponding 
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Btfinning  Dtith  Hizard  Conititionil  Survival  Lift  Btginning  Dtath  Hazard  Conditional  Survival  Lift  Btginning  Dtath  Hazard  Conditional  Survival  Lift 

agt  of  dtnsity  function  Hortality  function  Ezptctancy  agt  of  density  function  Hortality  function  Eiptctancy  agt  of  density  function  Hortality  function  Expectancy 

interval  functicn  Probability  interval  function  Probability  interval  function  Probability 

X  lO’f(x)  lO’h(x)  io»g,  10»p(xl  e(x)  x  lOM(x)  lO’h(x)  10*g.  10*p(i)  e(x)  x  lO’f(x)  lO’h(x)  I0*g.  10»p(i)  e(x) 


osAc4o«^mcnn>«c*^^ir»0«^m^p*>vcnc4cntoas<sjeD^‘ 

if>«.r«>ak^0^O«M<Sic*>^ir>ntfQ09«^Cv4^U3QDO«M^r»0^<SI‘ 

—  <MCn|<SIC<4<MC^Cnr»Sc^nngp-. 


v0cr«(siusc^eo(^r*>9'css 

^^M^wr'4CnlC<4<MC4 


'  CO  •A  O  nA  9n  nA 


tn<n<*»<n<*>cs«<Nic«4Cv*fsdf>b. 


'r«-io^cn<si— ^  —  o 


<iPto«OMcnvm«AOD9'«-«oi 


•-«oii/)r>>.9«CNivr^O(s4vo<rsC^wBtf*c*> 


M  lO  cst  ^ 


—  ^  —  ♦njor-^fv- 

9*  9*  4P*  9s  9«  9*  9s  9«  QO  CD  QD  OO  OD  OD  CO  OD  P*»  P*^  r**  ps« 

*jo<n<nirt9«m«r>«»-— .iCm«4»cs4cnoooo^<rs9*oosjP2.oos^r29so^<^j^<^«ir> 
^gO'uOf^CDO<^^r>oonP^«***D^sDC<49ssXi^nr'>CO^sD9'rs*OD^(09)vOOIO 

-I  —  —  •ii«<s5c«5<'5<^3r?c5cn^V»rt«o*Da^r«seD9'o— •<Ngro^aiOp»»9s--gOir»gpo 

*S4  (s4  ^  CS4  c>*> 

<n9soo9s^r>««-9^®sD<Dn92'9'2  —  Q[so£2"*ifieSS?^2Sr*^^'i!fi'^55*C 

n^^lO«XlCD9'^COIOOO'»^bfj9»<*>ao^O'sC*>**^»“Cr»9»9*9>«fOP*»»*l^lOir>p>*»«^xyO 


•M«iM-eM«»«M«MCN<SjC>4<S«CslC'>e*9'9^-«r4r>ld*>tDrwr^Q09>0«^rs4Cs4C*9^s9^C09s9 


♦  99«Dr>>.«D9*«  — 


p^sD<nr-0'*-vpooo'**<n¥?sDOO^ifto^~9^ci<—  9sp^ir>r^—  coif?  —  ^n9'^o 
CD  sA  r*>  sD  CD  CD  CD  ^  9*  9<  9<  ^  9>  O  O  C*)  ^  ^  y»  sD  V  CD  91  O  O  m  (>>4  CO 

•M^O9<aDF».k0U*l^rCc>4s>«O9»9*CDF>*>sAIO9c*9C'4^O9*OOr»sDnDm^rC<*>4^ 

p<^r«>r^s03sDnDsi>«^sOs0vonDtram«r»v>iomiP*>miomir>^^9^^^^^^^^ 


0999ao9>coaidoaDaDCoaaoaDC09S^!l:^!^^^^^*^^^^'^^'^iG 

Ds  9*  9*  9*  9*  9^  9*  9s  9*  9'  9*  9*  9*  9*  9*  9*  9s  9*  9'  9s  9*  9*  9s  9*  9s  9*  9*  9' 

•>w«09ir>OPV7C9r><fMOV?9M«r>*9<*>nD9rs«.tOe09«CPssDr4sD'>MP^9Cs«<s>i91|OO 

9*ODsD99cn9><*><n9ic.io4<«>9sD9»-»95^9>ir>t09>ir»br»99cncoe^«n9>tn9 


CO--9'<*»W|OvDlP^nr«-<nPN.s990DC»>s90D^CO»ObO'9090P*9^59P^t‘^r; 

»r>oo9fnc*»C'imc*ic^<Nirs4<Mco9»r«-oc^C'79gr>ioir*iogr>9cnro<n<s*CMC'ic>im 
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Taken  frot  Hsieh  (198S) 


spline  survival  functions  Sp(.)  in  (37)  are  to  be 
replaced  by  the  Gompertz  survival  functions  p(.) 
given  by  (IS).  When  zax«l  and  y-x,  then  (37)  reduces 
to  the  complete  life  table  function 

-  I  -  p(X'H)/p(x).  . P8) 

Furthermore,  when  and  x-y-X|,  then  P7)  reduces  to 

the  abhdBed  life  table  function  of  Section  3.1: 

. 

For  0<x<y<z<<i>,  the  conditional  mean 
lifetime,  elxsy.z],  represents  the  average  number  of 
years  lived  in  (y,zl  for  a  person  alive  at  age  x.  It 
is  the  conditional  expected  life  in  (y,z]  given 
survival  to  age  x  and  can  be  expressed  as  the 
diflereiKie  between  the  person-years  lived  beyond  y 
atxi  the  person-years  lived  beyond  z  divided  by  the 
survival  function  at  age  x.  In  qnnbols, 

eOrsy,zl  -  IT(y)-T(z)J/p(xX  . .(40) 

To  estimate  e(x:y,zl,  we  sitetitute  in  (40)  T(x)  and  p(x) 
givan  by  (27)  arxl  oix  respectivdy,  if  xeO.851,  aixl  by  ^5) 
aixl  Osl  respectively,  if  x>85.  If  y-x.  then  (40)  reduces  to 
the  weB-known  Markov  mean  residerrce  time  When  y-x 
and  z-o>.  then  (40)  reduces  to  the  usual  life  eigMctancy 
e(z)  of  Section  3.5. 

The  genaal  corxIitionBl  mortality  probabilities 
and  mean  life  times  described  in  this  subsection 
would  provide  additional  instruments  for  mortality 
analysis. 

4.  AN  EXAMPLE 

We  have  applied  the  methods  of  estimation 
described  in  Sections  3.1  through  3.6  to  the  data 
specified  in  Section  2  for  Canadian  males.  These  data 
are  available  from  the  Atuiual  Vital  Statistics  of 
Canada  (1979-1982).  We  use  a  base  period  of  three 
years  from  1980  to  1982,  inclusive,  to  calculate  the 
age -specific  death  rates  M,  and  person-years  of 
exposure  Y|  from  equations  (2)  and  (3),  which,  in 
turrt.  are  used  to  compute  q^  from  equations  (6)-(9X 
>4,  2 .  17.  The  value  for  q^  has  been  computed  from 

equations  (4)  and  (5).  The  spline  methods  and  Gompertz 
model  of  Sectfore  3.2  to  3.6  were  then  employed  to 
estimate  fOOl  h(x).  q^.  p(x)  and  e(x)  for  age  x  from  the  q 


values  so  computed.  Table  1  shows  the  final  results 

for  these  five  functions  at  x^,  I.  2 .  101.  102. 

Computation  of  these  functions  at  other  age  points 
can  be  done  sunOarly. 
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Extracting  Records  from  New  Jersey's  Multiple  Cause  of  Death  Files 
Giles  Crane,  New  Jersey  DeparUnent  of  Health 


Abstreict 

A  simple  microcomputer  system  has  been  developed 
using  off-the-shelf  components  which  permits 
local  access  in  an  acceptable  time  frame  to 
seven  years  of  New  Jersey  multiple  cause  of 
death  data  assembled  and  distributed  by  the 
National  Center  for  Health  Statistics.  The 
system  includes  hardware  and  software,  and 
illustrates  a  trade-off  between  speed  and 
specificity  of  access  to  approximately  70,000 
records  per  calender  year.  Applications  to  the 
epidemiology  of  drowning  and  sickle  cell  anaemia 
will  be  discussed  with  timing  infonnation  and 
rules  of  thumb  for  similar  investigations.  The 
numbers  of  causes  per  person  in  New  Jersey  will 
be  sumnarized  in  several  tables.  If  time 

permits,  the  further  analysis  of  abstracts  from 
this  data  will  be  illustrated  by  three  short 
examples:  conventional  statistial  analysis,  a 
computationally  intensive  method,  and  an 
application  of  artifical  intelligence  tecnique. 

1.  Introduction 

New  Jersey  is  the  fifth  smallest  state  in  the 
Union,  with  a  population  of  approximately  8 
million  people,  approximately  100,000  births 
per  year,  and  about  70,000  deaths  per  year.  As 
part  of  national  and  international  public 
health  efforts,  causes  of  death  are  coded  in 
the  ICD9  (International  Classification  of 
Disease  Codes)  (0)  and  entered  on  the  death 
certificate.  NJ  at  present  prepares  computer 
readable,  restricted  access  tapes  of  death 
certificates  which  include  the  "underlying 
cause"  of  death",  but  as  yet  NJ  does  not 
prepare  multiple  cause  of  death  tapes. 

In  the  view  of  some  epidemiologists,  the 
multiple  cause  of  death  tapes  are  the  single, 
most  important  source  of  epidemiological 
information  for  health  research.  It  was  viewed 
as  impjerative  to  improve  access  to  these 
requests  by  medical  researchers .  Some 

required  over  6  months  or,  in  several 
instances,  were  never  completed  due  to  the 
press  of  operational  processing  and  maintenance 
requirements  at  the  Department  minicomputer 
unit. 

There  are  many  other  NJ  health  outcome 
databases  which  offer  information  for  research 
and  yet  which  are  in  need  of  improved  access: 

1 .  Birth  Tapes 

2.  Death  Tapes 

3.  Hospital  Discharges 

A.  Drug  Treatment  Discharges 

5.  Cancer  Registry 

6.  Birth  Defects  Registry 

7.  Medical  Claims  Tapes 

8.  AIDS  Registry 

9.  Fetal  Death  Tapes 

10.  Poisoning  Reports 

11.  Cervical  Cancer  Screening  Reports 

12.  Conmunicable  Diseases 

13.  Family  Planning 

lA.  Hemophilia  Financial  Assistance 

15.  Mental  Retardation  .Services 


16.  Fatal  Accidents  Reports 

17.  Homicide  Reports 

18.  Community  Mental  Health  Centers 

2.  Multiple  Cause  of  Death  Records  (1,2) 

The  distributing  part  of  the  path  of  the 
multiple  cause  of  death  information  for  this 
project  was  composed  of  the  following  steps: 
(from  the  top  down — there  also  is  a  bottom  up 
set  of  steps)Federal  Vital  statistics — coding 
PRIME  minicomputer  or  Ed. Comp.  Network 
reblocking.  NJ  Center  for  Health  Statistics  — 
minitape.  Micro-computer  —  stripping,  packing, 
compacting,  and  access. 

File  sizes  for  the  7  years  of  Multiple  Cause  of 
Death  records  are: 


YEAR  RECORDS  MBYTES 


1979 

30.79 

1980 

71,202 

31.33 

1981 

69,557 

30.61 

1982 

69,85A 

30.  7A 

1983 

71,627 

31.53 

198A 

71,7A3 

31.57 

1985 

73,520 

32. 3A 

Under  the  present  system,  the  1986  records  may 
become  available  sometime  in  September  1988. 
There  appears  to  be  scope  for  improvement  by  new 
l^clinology,  or  organization,  or  both. 

The  MCD  records  were  stripped  of  unnecessary  or 
redundant  fields,  including  "record  access" 
fields,  and  only  10  of  a  possible  20  "entity: 
axis"  (original  ICD9  codes)  were  preserved. 
These  short  MCD  records  were  packed,  two 
characters  (0123A56789  Z)  per  byte,  and  then 
subjected  to  a  compacting  utility.  The  records 
sizes  are  shown  below: 

AAO  bytes  M.C.D.  record  (AA2  with  CR,LF) 

129  Short  M.C.D.  record  (no  CR,LF) 

65  Packed,  short  M.C.D.  record  (no  CR,LF) 

A  C  program,  called  GETMCD,  written  to  strip  the 
records,  pack  them,  and  also  to  upack  and  access 
them  by  ICD9  codes.  Principal  access  is  by 
Multiple  Cau.se  of  death  codes  (ICD9CM): 
Underlying  cause  code — A  characters  e.g.  0A60, 
0A6.  Contributing  cause  entity  axis  codes -7 
character. 

char  1  Death  Cert.  Line  no.  1..6 
2  Sequence  No.  on  line  1..7 
3-6  Cause  code:  A  char  ICD9CM 

7  1  if  nature  of  cause,  0  otherwise 
Causes  are  specified  somewhat  like  MS-DOS  file 
names,  in  which  "?"  denotes  any  single  character 
and  is  translated  into  a  blank.  For 

example,  sickle-cell  disease  is  called  for  by 
"7728260",  i.o.  any  death  certificate  line,  any 
code  on  line,  ICD9  code  282.6,  and  not  a  nature 
of  cause  code.  Lung  disease  is  accessed  by 
several  codes:  "??500?0",  "  ??501?0",  "??502?0", 
■■?7503?0".  "??50A?0",  ”??505?0". 

3.  Hardware  (3) 

The  hardware  used  in  this  realization  of  the 
access  system  consisted  of  a  Compaq  Deskpro  dual 
.si)eed  microcomputer  (6A0K  memory,  30  Mbyte  hard 
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disk,  360K  floppy  disk,  Irwin  10  Mbyte  DCIOOO 
minicartiridge  tape  drive). 

4.  Software  (3,4) 

The  software  consisted  of  Coinp)aq  MD-DOS  Version 
3.00  as  an  operating  system,  the  TAPE.EXE 
Version  1.05  tape  utility.  Turbo  C  compiler,  a 
public  domain  packing  program,  a  C  program  for 
selection  at  high  speed,  a  simple  editor,  and 
several  MS-DOS  ooniiiand  files. 

Detailed  processing  of  the  data  abstracted  from 
the  files  is  done  by  a  range  of  techniques  from 
printing  the  small  file  and  inspecting  the  data 
to  more  elaborate  database  and  statistical 
packages  favored  bv  individua]  investigators 
(LOTUS,  DBASE,  SPSS-X  GLIM3,  PRODAS). 

5 .  Problems  encountered 

Typical  tape  and  file  handling  problems  were 
encountered  such  as  a  tape  with  several  bad 
blocks,  hardware  problem  with  the  reel  magnetic 
tape  uit,  inadv'ertant  changes  in  directory  names 
when  processing  different  years,  and 
inconsistencies  in  using  CR,LF  as  record 
terminators. 

6.  Tests  of  the  system 

The  C  program  GETiCD  was  tested  in  several  ways. 
Results  of  various  test  requests  were  checked 
against  the  NCHS  Underlying  Cause  counts  and 
against  control  tables  in  the  MCD  tpae 
documentation  (1). 

The  C  debugging  process  also  provided  further 
assurrance  that  the  program  was  functioning  as 
desired. 

The  packed  files  were  inspected  by  an 
independent  software  package  which  displayed  the 
packed  records  in  HEX  format,  a  visual  unpacking 
of  the  2  "nibbles:  which  make  up  an  8-bit  byte. 
The  entire  process  was  run  forward  and  backward, 
selecting  all  records. 

finally,  a  detailed  investigation  of  the  full 
records  from  1984  provided  additional  checks  of 
the  packing  and  selecting  process.  (See  next 
section. ) 

7.  1984  Multiple  Cause  of  Death  Records 

The  number  of  causes  of  death  for  1984  in  New 
Jersey  were  analysed  briefly  in  order  to  count 
the  records  with  over  10  contributing  causes  of 
death,  to  provide  further  checks  of  the  system, 
and  to  provide  general  guidance  for  other 
researchers.  Quito  probably,  this  is  the  first 
time  the.se  figures  have  been  published. 


Number  of 

Number  of 

Contributing  Causes 

People 

Percent 

1 

9,487 

13 

2 

17,519 

24 

3 

19,262 

27 

4 

13,440 

19 

5 

6,883 

to 

b 

3,028 

4 

7 

1  ,250 

2 

8 

511 

1 

9 

205 

0.3 

10 

92 

0.1 

Number  of  Number  of 

Contributing  Causes  People  Percent 

11-14  66  0.09 

The  average  number  of  contributing  causes  for 
1984  was  3.14.  (It  may  be  noted  that  the 
Certificate  of  Death  has  three  lines  for 
contributing  cause  of  death.)  The  average 
number  of  causes  was  calculated  by  age,  race, 
sex,  marital  status,  and  presense  of  autopsy: 


Age 

Race 

Sex 

0 

'  T.97  ■ 

Other  Asian  2.89 

1-4 

3.30 

White  3.14 

male  3.11 

5-9 

3.44 

Black  3.15 

female  3.16 

10-14 

3.30 

Am.  Indian  3.12 

15-24 

3.49 

Chinese  2.93 

25-34 

3.22 

Japanese  2.86 

35-44 

3.00 

Hawaiian  2.5  (2 

only) 

45-54 

2.91 

Fikipino  3.36 

55-64 

2.93 

All  other  2.91 

65-74 

3.10 

75-84 

3.25 

85+ 

3.25 

Marital  Status 

Autopsy 

single 

3.14 

yes  3.35 

married 

3.08 

no  3.13 

widowed 

3.21 

divorced 

3.03 

not  stated 

2.91 

8.  Applications 

In  the  5  months  of  operation,  this  system  has 
been  applied  by  members  of  the  Office  of 
Research,  the  Divisions  of  Maternal  and  Child 
Health,  the  Occupational  Health  Program, 
Narcotic  and  Drug  Abuse,  Cardiovascular  Disease 
Unit  and  the  AIDS  Division,  as  well  as  by  other 
members  of  the  Office  of  the  Commissioner  of 
Health. 

In  an  investigation  of  drowning  and  near 
drownings  (5),  immersion  injuries  leading  to 
death  in  New  Jersey  were  identified  and  selected 
for  1981-85.  These  records  were  matched  with 
hospital  discharge  data  and  further  analysis  was 
done  in  order  to  calculate  incidence  rates  and 
ca.se  fatality  ration  by  age,  race,  sex,  and 
county. 

An  investigation  of  sickle  cell  disease  (6), 
records  were  selected  using  ICD9CM  codes  for 
282.4  (Thalassemias),  282.6  (Sickle  cell 
anemia),  and  282.7  (Other  hemoglobinopathies), 
and  a  short  paper  was  published  giving  the 
results  and  implications. 

After  an  initial  study  of  hospital  costs  (7), 
several  AIDS  related  studies  are  underway  which 
involve  data  from  three  other  sources  in 
addition  to  selections  from  the  Multiple  Cause 
of  Death  Records. 

9.  Examples  of  further  analysis  methods. 

Tlie  applications  discussed  were  developed  using 
conventional  statistical  analysis  techniques  (8) 
including  cross  tabulations,  regression, 
histogr.am  and  bar  charts,  and  confidence  limits. 
More  computationally  inten.sive  methcxls  such  as 
the  bootstrap  and  jacknife  (9)  are  being  applied 
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using  short  C  programs  and  detailed  applications 
of  elaborate  statistical  packages  (8). 

At  this  time  artificial  intelligence  techniques 
(10)  are  the  subject  of  experiment,  and  various 
strategies  for  employment  of  the  AI  inference 
engine  are  being  considere.  One  possible 
project  involving  the  learning  aspect  of  AI 
would  be  to  build  a  knowledge  base  of  related 
diagnoses. 

10.  Further  work 

The  system  could  be  improved  by  purchase  of  a 
faster  sequent ial,  access  device  (faster 
minitape,  CD  Rom,  video  cassette),  or  dedication 
of  the  hard  disk  to  storage  of  the  files.  The  C 
program  could  be  further  optimized  for  speed, 
either  applying  an  optimizing  C  compiler  (for 
example,  Microsoft)  or  re-writing  the  C  program. 
The  use  of  the  file  compaction  program  might  bo 
eliminated,  or  compaction  could  be  made  record 
selection  a  part  of  the  compaction  utility.) 
Combining  the  compaction,  selection,  and  tape 
reading  utility  into  one  program  would  reduce 
the  passes  through  the  tape  from  3  to  1. 
Alternate  storage  orders  for  the  records  might 
also  be  considered,  perhaps  by  ICD9-CM  code  over 
all  years.  The  file-by-file  rather  than  record- 
by-record  capability  of  the  current  tape  drive 
is  a  constraint  on  the  problem,  which  has  been 
considered  many  times  for  record-by-record  ta[)e 
systems. 

Barriers  to  improvement  are  increased  cost  of 
hardware,  other  users  competing  for  time  on  the 
microcomputer,  manufacturers  who  do  not  wish  to 
release  .software  to  directly  access  minitape 
drives. 

Arranging  the  tape  files  in  the  best  order  by 
learning  the  years  most  frequently  requested  did 
not  increase  efficiency  here  since  the  mini  tape 
cartridge  rewound  automat ically  after  restoring 
a  single  file.  it  was  decided  to  place  the 
yearly  files  in  sequential  order,  4  to  a  tape. 
Having  the  last  few  years  on  the  last  tape  will 
eliminate  some  tape  changes.  A  40  Mbyte  tape 
drive  will  probably  be  adequate  to  hold  all  MCD 
data  for  the  life  of  the  system  (sec  below. 
Further  Work). 

1 1 .  Summary 

Before  this  acce.ss  system  was  devised,  requests 
for  abstracts  from  MCD  files  from  researchers  at 
the  NJ  Dept,  of  Health  could  require  months  or 
were  not  possible.  After  the  MCD  files  became 
locally  available,  more  than  10  different 
rc.searchers  were  able  to  access  this  data, 
usually  within  two  days.  .Studies  involving  this 
data  have  been  presented  at  meeting.s  of  the 
local  working  group  on  health  data,  a  national 
conference,  and  an  international  conference. 
Pre-proce.ssing  of  the  L<ijw  information 
(transference  to  microcomputer,  stripping  of 
redundant  field,  packing,  compacting, 
transfering  to  minita[x?)  can  be  viewed  as  moving 
the  common  part  of  tlie  time  required  for 
satisfying  any  request  to  time  required  to 
prepare  the  system.  This  appears  to  be  a 


general  principal  which  motivates  the  formation 
of  many  systems. 
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Appendix:  E.stimate  of  time  and  cost  to  change 

birth  or  death  certificate. 

At  the  pre.sentat  ion  of  this  paj^r,  one 
conference  member  asked  how  long  and  how  much  it 
would  cost  to  correct  a  birth  or  death 
crTtificale  in  New  Jer.sey.  Corrections  to  birth 
certificates  are  made  at  no  cost  and  the 
correction  from  is  in.serted  at  tlio  end  of  the 
queue  of  certificates  on  hand.  At  this  dale, 
February  birth  certificates  are  being  ontereu 
and  .so  there  will  be  a  delay  of  h  months.  Also, 
the  usual  charge  will  be  made  for  a  new 
certificate.  As  for  corrections  to  death 
certificates,  a  similar  policy  holds,  hut  a  law 
state  that  new  death  certilicates  must  bo 
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available  with  60  days.  Experience  here 
indicates  that  a  correction  which  is  passed  down 
from  Federal  Vital  Statistics  and  which  requires 
verification  at  the  office  nearest  the  site  of 
death  may  require  as  long  as  6  months. 
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A  PROBABILISTIC  APPROACH  TO  RANGE  DATA  SEGMENTATION 


EZZET  AL-HUJAZI,VAYNE  STATE  UNIVERSITY 
ARUN  SOOD, GEORGE  MASON  UNIVERSITY 


ABSTRACT 

In  this  paper  we  present  a  region  growing 
approach  for  segmenting  range  images  based  on 
the  H  (Mean  Curvature)  and  K  (Gaussian 
Curvature)  parameters.  Range  images  are 
unique  in  that  they  directly  approximate  the 
physical  surfaces  of  a  real  world  3-D  scene. 
H  and  K  are  defined  from  the  fundamental 
theorems  of  differential  geometry,  and 
provide  visible,  invariant  pixel  labels  that 
can  be  used  to  characterize  the  scene.  The 
sign  of  H  and  K  can  be  used  to  classify  each 
pixel  into  one  of  eight  possible  surface 
types.  Due  to  the  sensitivity  of  these 
curvature  parameters  to  noise,  the  computed 
HK-sign  map  does  not  directly  identify 
surfaces  in  the  range  image.  In  this  paper  a 
probabilistic  approach  for  the  segmentation 
of  the  range  image  is  suggested.  The  image  is 
modeled  as  a  Markov  Random  Field  on  a  finite 
lattice.  The  prior  knowledge  about  the 
solution  is  expressed  in  the  form  of  a  Gibbs 
probability  distribution.  This  approach 
allows  the  integration  of  the  output  of  a 
number  of  modules  in  an  efficient  way.  The 
performance  of  the  proposed  technique  on  a 
number  of  range  images  will  be  presented. 

1.  INTRODUCTION 

The  statistical  techniques  for  modeling 
and  processing  image  data  has  seen  an 
increasing  interest  in  computer  vision 
literature  resently.  Most  of  the  work  has 
been  directed  toward  application  of  Markov 
Random  Field  (MRF)  models  to  problems  in 
texture  modeling  and  classification  and 
problems  in  segmentation  and  restoration  of 
noisy  and  textured  images  1 2 , 4 , 5 , 6, 7 , 9  | . 

In  differential  geometry  the  information 
given  by  the  sign  of  H  and  K  can  be  used  to 
classify  a  surface  point  into  one  eight 
possible  labels.  These  two  surface  cu^atures 
are  derived  from  the  first  and  second 
fundamental  forms.  They  are  sensitive  to 
noise  and  the  resulting  HK-sign  map  does  not 
correspond  directly  to  surfaces  in  the  image 
and  thus  it  has  to  be  further  processed. 

In  this  paper  an  algorithm  based  on  MRF 
and  edge  models  is  suggested  for  processing 
the  HK-sign  map.  This  approach  is  chosen 
because  it  allows  an  analytical  basis  for 
integrating  a  number  of  object  features.  A 
variable  neighborhood  area  is  used  for  the 
MRF  which  gives  a  good  compromise  between  the 
speed  of  processing  and  the  number  of  pixels 
misclassif ied  by  the  algorithm. 

The  paper  is  organized  as  follows.  Section 
2  presents  a  review  of  relevant  differential 
geometry  results,  and  Section  3  presents  a 
review  of  MRF  and  the  Gibbs  Distribution 
(GD).  Our  algorithm  will  be  given  in  Section 
4.  Section  5  shows  results  of  processing 
various  range  images  and  Section  6  outlines 
the  conclusions. 

2.  H  AND  K  CURVATURE  PARAMETERS 

H  and  K  are  identified  as  the  local  second 


order  surface  characteristics  that  possess 
several  invariance  properties  and  represent 
extrinsic  and  intrinsic  surface  geometry 
features  respectively.  The  sign  of  these 
surface  curvatures  can  be  used  to  classify 
the  image  surface  points  into  one  of  eight 
basic  types.  Fig.l  shows  the  corresponding 
surfaces  labels.  These  two  curvature 
parameters  can  be  calculated  using  [I]  : 


K= 
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Fig.l  Surface  type  labels  from  H  and  K. 

Some  of  the  problems  with  the  HK-sign  map 
are  :a)  Preliminary  smoothing  is  necessary  to 
obtain  reasonable  values  for  H  and  K  (1). 
However,  after  filtering  the  HK-sign  surface 
labels  then  reflects  the  geometry  of  the 
smoothed  surface  data.  Hence,  the  HK-sign  map 
must  be  further  processed,  b)  In  the  presence 
of  noise  HK-sign  map  surface  labels  tend  to 
connect  tbe  labels  of  neighborhood,  but 
distinct,  surface  regions,  c)  Global  surface 
properties  is  lacking. 

3.  MRF  AND  THE  GD 

The  concept  of  a  MRF  is  a  direct  extention 
of  the  concept  of  a  Markov  process  to  higher 
dimension  18).  A  discrete  MRF  on  a  finite 
lattice  is  defined  as  a  collection  of  random 
variables,  which  correspond  to  the  sites  of 
the  lattice. 

Definition  of  MRF: 

Consider  an  N  X  N  rectangular  lattice,  and 
r=  (i,j)  be  an  index  of  pixel  locations, 
where  i,j  specify  pixel  row  and  column 
location  and  satisfy  1<  i,j  <  N.  Let  (x^.) 
denote  a  random  field,  with  x^  the  field  at 
pixel  r,  X  a  vector  specifying  the  field  over 
an  entire  N  X  N  lattice  and  having  components 
X  ,  and  X.  .  the  field  everywhere  but  at 
p[xel  r.  Th4SMx^)  is  a  MRF  if 

P(Xr(5'(r))'P^’'rl’‘v’  ''  ®  °p’ 

for  all  r,  and  P(X=x)>0  for  all  x.  Dp  denotes 
a  neighbor  set,  2 

Dp=(v=(l,m)  such  that  l|i-v||  <  Np 

and  vkr) 
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Where  P  is  the  order  of  the  process,  and  Np 
is  an  increasing  function  of  P.  N_  takes  the 
values  1,2, 4, 5, 8  for  P=l,2,3,4,5, 

respectively  (Np  is  the  square  of  the 
euclidean  distance  to  the  farthest  neighbor). 

Definition  of  cliques: 

Given  a  set  of  neighborhood  on  a  lattice  a 
clique  c  is  such  that; 

a)  c  consists  of  a  single  pixel,  or 

b)  for  r  V  ,  r  G  c  and  v  G  c  implies  that 
r  and  v  are  neighbors.  The  collection  of  all 
cliques  is  denoted  by  C.  The  neighborhood 
system  up  to  the  fifth  order  and  the  cliques 
associated  with  the  first  order  system  are 
shown  in  Fig. 2. 

r~l  |5|4|3|4|5| 

I _ I  |4|2|1|2|4| 

I  I  |3|l|xll|3| 

I _ I  |4|2|1|2|4| 

|5|4|3|4|5| 


Fig.  2  The  neighborhood  systems 
and  the  cliques  for  the  first  order. 

Definition  of  GD: 

ft  random  field  X={x  )  defined  on  a  lattice 
has  an  associated  GD  "^(or  equivalently  is  a 
Gibbs  Random  Field(GRF)  with  respect  to  Dp 
iff  its  joint  distribution  is  of  the  form: 

P(X=x)  =(1/Z)*exp(-U(x)) 

where  U(x)=^^^V^(x)  is  the  energy  function, 

V  =potential  associated  with  clique  c, 
aSd  Z=Iexp(-U(x))  is  a  normalization  factor. 

Hammersley-Clif ford  theorem:-Let  Dp  be  a 
neighborhood  system  on  a  finite  lattice  .  A 
random  field  X  is  a  MRF  with  respect  to  D 
iff  its  joint  distribution  is  a  GD  with 
cliques  associated  with  Dp. 

4.  OUR  ALGORITHM 

Biological  vision  systems  achieve 
efficient,  robust  and  reliable  recognition  in 
highly  variable  environments  through  the 

integration  of  many  visual  sources.  For 
example  the  simple  task  of  locating  objects 
boundaries  can  be  performed  far  more 

effectively  by  integrating  evidence  of 

discontinuities  in  image  intensity,  stereo 
disparity,  speed  and  direction  of  motion  and 
texture  information  than  by  using  evidence 
from  a  single  visual  source  on  its  own.  The 
integration  problem  is  computationally 
complex.  The  integration  can  be  achieved  by 
associating  a  MRF  on  a  lattice  to  each 
physical  process  and  another  (binary)  model 
to  its  d i scon t inut  ies .  The  lattice  are 
coupled  to  each  other  to  reflect  the 
interdependence  of  the  corresponding  proces.s 

in  image  formation.  Similar  work  using  this 
approach  can  be  found  in  |6,9|  among  others. 
In  general,  the  latter  methods,  ate 
computationally  expensive  and  the  number  of 
quantization  levels  must  be  small  (typically 


2  or  3).  The  use  of  H  and  K  allows  us  to 
reduce  the  number  of  levels  from  256  (the 
original  image)  to  3  levels  (-,0,+  for  H  and 
K). 

The  flow  chart  of  our  algorithm  is  shown 
in  Fig. 3.  The  H  and  K  are  calculated  in 
multi-scale  fashion.  Then  the  output  of  the 
multi-scale  is  combined  with  the  edge 
information  and  the  surface  normals.  This 
will  give  us  a  seed  region  and  edge 
information  which  will  be  entered  to  the 
region  growing  algorithm.  H  and  K  are 
processed  separately  and  then  combined  to 
obtain  the  HK-sign  map.  Final  surface 
description  of  the  object  can  be  obtained  by 
fitting  surfaces  to  the  HK-sign  map. 


4.1  Finding  the  Seed  Region: 

The  seed  regions  are  obtained  using  a 
multi-scale  approach.  This  approach  is 
justified  because  the  output  from  different 
scales  is  going  to  change  significantly  on 
the  boundary  of  the  object  while  the  points 
well  inside  the  surface  will  not  change.  The 
input  image  is  smoothed  with  a  Gaussian 
filter  of  different  standard  deviation  and 
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for  each  output  the  values  of  H  and  K  are 
estimated.  The  sign  of  the  resulting  H  and  K 
values  are  then  used  to  form  a  three  level 
image  for  H  and  K.  The  outputs  from  the 
multi-scale  are  then  combined  by  identifying 
the  points  on  the  different  scales  (3  in  our 
expermints)  where  the  value  of  H  and  K  signs 
are  identical.  The  labeling  of  these  points 
are  assumed  to  be  correct  and  is  used  as  the 
seed  region  for  the  region  growing  algorithm. 
The  edges  are  also  obtained  and  superimposed 
on  the  the  multi-scale  output. 

In  some  cases,  for  example  for  a  roof 
edge,  surface  normal  information  is  needed  to 
segment  the  planar  regions.  The  surface 
normal  is  estimated  by  fitting  a  plane  in  a 
3X3  mask  size.  The  surface  normal  information 
is  also  used  to  segment  the  background  of  the 
object. 

A. 2  Region  Growing: 

The  seed  region  output  is  entered  to  the 
region  growing  step.  The  region  points  are 
modeled  as  a  MRF  with  variable  neighborhood 
size.  For  the  edge  points,  an  additional  term 
is  used.  The  energy  function  integrates  these 
two  models  : 

U(f,e;g)=:I  V(e)  +  (l-e)*I  V(f  ,f  |e) 

P  iJp  r  V 

where  g  is  the  output  from  the  seed  region 
step,  e  is  the  edge  point  (binary),  f  is  the 
processed  image.  Dp  is  the  neighborhood 
system,  V(e)  is  the  energy  function  due  to 
the  presence  of  edge.  V(e)  is  computed  by 
using  Fig. 4  in  which  a  value  is  assigned  for 
V(e)  based  on  all  the  possible  local 
configurations  of  the  edge  point.  This  model 
encourages  the  formation  of  continuous  edges 
and  discourages  thick  edges.  For  example  if 
points  B  and  C  are  edge  points  the  model 
discourage  the  presence  of  edge  at  A. 

V(f  ,f  |e)  is  the  energy  function  due  to  the 
pixel  '^label  in  the  neighborhood  area  given 
the  edge  points.  Only  the  single  pixel  clique 
is  used  in  the  experimental  results. 

A  B  C  D  V(e) 


★  * 

C  D 

*  * 

A  B 

0  noedge 
1  edge 


Fig. A.  The  Edge  Model. 

The  region  growing  algorithm  proceeds  by 
collecting  the  edge  points  and  the  pixels 
unclassified  by  the  multi-scale  approach  in 
an  array.  A  point  is  then  picked  at  random. 
The  energy  function  given  earlier  is  then 
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minimized  in  the  local  area  surrounding  the 
selected  pixel.  This  is  repeated  for  all  the 
points  in  the  array  for  a  number  of 

iterations  (maximum  of  30  was  used  in  the 
experimental  results). 

5.  EXPERIMENTAL  RESULTS 

The  algorithm  has  a  good  parallel 

computational  structure,  since  the 

multi-scale,  edge  detection  and  the  surface 
normal  estimation  can  be  computed 
simultaneously.  Also  the  computation  of  H  and 
K  are  independent  and  can  be  computed 

simultaneously.  The  algorithm  has  been  tested 
an  a  number  of  synthetic  and  real  images.  The 
images  are  128X128  with  8  bits/pixel.  The  H 
and  K  values  are  obtained  following  the 
procedure  suggested  be  [I].  Experimental 
results  for  different  objects  are  shown  in 

Fig. 5  through  7.  To  assess  the  importance  of 
the  edge  information, images  are  processed 
with  and  without  the  edge  model.  We  have  used 
a  variable  neighborhood  systems  (up  to  the 
fifth  order)  for  the  region  model. 

The  first  object  (Fig. 5a)  is  a  synthetic 
image  of  a  sphere.  The  output  of  the  seed 
region  step  is  shown  in  Fig. 5b  for  H  and  in 
Fig.5e  for  K.  The  result  of  the  region 
growing  step  is  shown  in  Fig. 5c  for  H  and  in 
Fig.Sd  for  K.  Fig.5f  shows  the  final  HK-sign 
map  obtained  by  combining  the  output  from 
Fig. 5c  and  Fig.5d.  As  can  be  seen  the  image 
is  segmented  perfectly  using  this  method. 
Fig. 6a  is  a  range  image  of  a  coke  bottle 
obtained  using  a  laser  range  finder  at  the 
Enviromental  Research  Institute  of  Michigen 
(ERIM).  The  content  of  Fig. 6  are  similar  to 
Fig. 5.  Good  segmentation  is  obtained  with  the 
exception  of  a  small  area  at  the  tip  of  the 
coke  bottle. 

Fig. 7  shows  the  results  for  a  coffee  cup 
obtained  from  ERIM.  Fig. 7a  shows  the  image. 
Fig.7d  and  Fig.7h  show  the  seed  region 
obtained  for  H  and  K  respectively.  The  range 
image  is  then  processed  in  two  different 
ways.  Fig.7e  and  Fig.7i  show  the  output  of 
the  region  growing  algorithm  with  the  edge 
model.  Fig. 7b  shows  the  final  HK-sign  map. 
The  segmentation  results  obtained  were  good 
with  the  exception  of  the  handle  of  the 
coffee  cup,  which  was  not  classified.  This  is 
because  of  the  size  of  this  region  and  the 
restriction  in  the  algorithm  on  the  number  of 
pixels  required  for  classification.  In  Fig.7f 
and  Fig.7g  the  outputs  of  the  region  growing 
algorithm  without  the  edge  model  are  shown. 
Fig. 7c  shows  the  fina^  HK-sign-map.  In  this 
case  the  handle  is  classified  as  planar 
region,  also  small  regions  of  the  cylindrical 
surfaces  of  the  object  are  classified  as 
planar.  A  comparative  study  of  Fig. 7b  and 
Fig. 7c  illustrates  that  inclusion  of  the  edge 
model  leads  to  less  misclassi f ied  points. To 
emphasize  the  advantage  of  using  a  variable 
neighborhood  system  for  the  MRF.  Fig. 8  shows 
the  results  of  processing  the  coffee  cup  with 
different  fixed  neighborhood  systems.  In  this 
figure  the  time  required  for  processing  is 
compared  for  five  different  neighbothood 
systems.  The  time  required  for  the  variable 
neighborhood  system  (up  to  the  fifth  older) 
is  also  shown. 
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Fig. 8  Comparsion  between  fixed  and  variable 
neighbohood  system  a)Time  (Min),  b)Dif£erence 
in  classification  (  ICOO  Pixels). 

and  K  allow  us  to  work  with  a  small  number 
of  levels  (  3  compared  with  256)  which  makes 
the  processing  faster.  The  use  of  variable 
neighborhood  system  MRF  reduces  the  number  of 
misclassif ied  pixels  with  a  small  Increase  in 
the  time  required  for  processing. 

The  future  work  will  concentrate  on  a 
surface  fitting  step  which  will  be  used  to 
obtain  a  final  description  of  the  range  data. 
This  will  also  help  in  classifying  the 
unclassified  pixels  in  the  output  of  the 
algorithm.  Also  the  algorithm  relies  on  the 
initial  seed  region  calculation  using  the 
multi-scale  approach.  A  possible  variation  to 
the  procedure  and  to  the  parameter  estimation 
will  be  studied. 
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ABSTRACT 

The  purpose  of  this  paper  is  to 
measure  the  amount  of  compression  that 
can  be  accomplished  by  the  use  of 
Arithmetic  Coding  and  the  coding 
processing  time.  An  IBM-PC  based  system 
has  been  developed  for  both  encoding  and 
decoding.  The  results  using  adaptive  and 
non-adaptive  techniques  are  presented.  The 
test  data  consisted  of  a  256  gray  level 
image  file  and  seven  classes  of  different 
data  files.  Performance  evaluation  is 
discussed  in  terms  of  encoding  time  and 
decoding  time. 

I.  Introduction 

Minimum  redundancy  codes  of  a  data 
system  is  attractive  for  two  major 
reasons:  storage  saving  and  performance 
improvement.  Storage  saving  is  a  direct 
and  obvious  benefit,  whereas  performance 
improvement  is  the  direct  result  from  the 
fact  that  less  data  are  moved  in  the  case 
of  communication.  Arithmetic  coding  and 
Huffman  coding  are  approximately  minimum 
redundancy  coding  techniques  where  code 
words  are  of  variable-length. 

Huffman  coding  is  one  of  the 
pioneering  works  in  the  construction  of 
minimum  redundancy  code.  It  was  developed 
in  1952  by  Huffman  [1].  Because  of  its 
simplicity,  it  has  been  developed  on  small 
systems  with  encouraging  results  [2].  To 
code  a  file  using  the  standard  Huffman 
method: 

1.  Determine  the  frequency  of  each 
character. 

2.  Construct  Huffman  coding  table  by 
assigning  variable  length-codes  to  each 

character.  Generally,  this  results  in  the 
assignment  of  short  codes  to  characters 
that  occur  most  frequently. 

3.  Encode  the  input  file. 

4.  At  any  future  time,  the  file  can  be 
reconstructed  using  the  stored  Huffman 
coding  table. 

Arithmetic  coding  [3]  has  been 
proposed  as  being  more  superior  in  most 
respects  than  the  Huffman  scheme.  Here, 
the  input  message  is  represented  as  an 
interval  of  real  numbers  between  0  and  1. 
The  longer  the  message,  the  smaller  the 
interval  needed  to  represent  it,  and  thus 
more  bits  are  needed  to  describe  the 


interval.  An  individual  symbol  of  the 
message  reduces  the  size  of  the  interval 
by  an  amount  determined  by  its  frequency 
of  occurrence.  The  more  likely  symbol 
reduces  the  range  by  less  than  an  unlikely 
one,  and  consequently  adds  fewer  bits  to 
the  coded  message.  The  end  of  the  message 
is  represented  by  a  unique  message 
terminator  symbol. 

Arithmetic  coding  technique  was 
introduced  in  a  textbook  by  Abramson  [4]. 
As  a  compression  technique,  this  method  is 
not  widely  known.  However,  reference  [5] 
is  a  good  introduction  to  the  subject  of 
arithmetic  coding. 

II.  Algorithms  and  Implementation 

Both  the  encoder  and  the  decoder  know 
(or  can  generate)  the  probabilities  of 
occurrences  of,  and  the  portions  of  the 
range  occupied  by,  the  various  symbols, 
and  the  initial  range  is  [0,1).  with  this 
in  mind,  the  decoder  can  deduce  the 
encoded  characters  one  by  one  by  analyzing 
which  range  the  interval  lies  within  as 
each  symbol  is  revealed.  Also,  both 
encoder  and  decoder  know  a 
unique_eof_symbol  that  will  be  used  to 
terminate  messages. 


The  encoding  and  decoding  algorithms 
can  be  summarized  as  follows: 


ENCODE 

while  not  eof 
begin 

read  symbol 
if  eof 

symbol  =  unique_eof_symbol ; 
current_range  =  range_high  -  range_low; 
range_high  =  ranqe_low  +  current_range  * 
frequency_sum[ symbol  -  1];  ~ 

range_low  =  range_low  +  current_range  * 
f requency_sum [ symbol ] ; 
end; 


DECODE 

while  symbol  <>  unique_eof_symbol 
begin 

read  code_value; 
symbol  =  1 ; 
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while  not  (frequency_sum[ symbol]  <= 
(code_value  -  range_low)  /  (range_high  - 
range_low)  <  (frequency_sum[ symbol  -  1]) 
symbol  =  symbol  +  1; 
if  symbol  =  unique_eof_syinbol 
stop; 

current_range  =  range_high  -  range_low; 
range_high  =  range_low  +  current_range  * 
frequency_sum[ symbol  -  i]; 
range_low  =  range_low  +  current_range  * 
f requency_sum [ symbol ] ; 
write  symbol; 
end; 

In  the  above  pseudocode,  the  possible 

symbols  are  numbered  1,  2,  . . 

number_of_symbols  +  1,  with  the  last 
symbol  being  the  unique_eof_symbol .  The 
frequency  range  of  an  individual  symbol 
is: 

[frequency_sum[ symbol]  , 

frequency_sum[ symbol  -  1}) 

In  practice,  we  first  generate  a  list 
containing  probabilities  for  each  symbol 
that  can  be  encoded.  Since  each  symbol  is 
a  byte,  the  set  pf  symbols  can  be 
represented  by  integers  from  0  to  255, 
with  the  unique_eof_symbol  being  256.  We 
represent  this  scheme  as  a  257  position 
array.  The  frequencies  can  be  represented 
in  one  of  two  ways;  a  non-adaptive  or 
fixed  representation  in  which  the 
frequencies  are  determined  in  advance  for 
a  given  class  of  data  files  to  be 
compressed  (i.e.  image  files),  and  an 
adaptive  method  in  which  the  frequencies 
are  generated  based  on  the  symbols 
observed  during  compression  or  expansion 
of  the  file  being  processed.  We  will  see 
that  the  adaptive  model  in  general 
provides  more  desirable  results  than  the 
fixed  model.  Probabilities  are  represented 
as  integers  and  cumulative  counts  are 
stored  in  a  second  array.  To  prevent 
overflow  the  counts  are  scaled  as 
necessary. 

Since  all  values  are  represented  as 
integers  and  operations  are  performed  on 
only  a  byte  at  a  time,  all  data  must  be 
transmitted  and  received  incrementally.  To 
accomplish  this,  bits  in  the  low  and  high 
ends  of  the  range  are  transmitted  as  they 
become  the  same  and  the  range  is  rescaled. 

To  begin  the  encoding  process  a  byte 
is  read  and  is  used  as  an  index  to  the 
frequency  array.  In  the  non-adaptive  or 
fixed  case  the  frequency  is  simply  read 
from  the  array.  In  the  adaptive  case  the 
current  frequency  is  used  and  the  the 
frequency  is  updated  to  include  the  new 
symbol.  Next  the  frequency  is  applied  to 
the  range  and  the  next  byte  is  read.  A 
code  buffer  is  maintained  to  hold  bits  to 
be  transmitted.  When  the  byte_long  buffer 
is  filled,  the  byte  is  written  then 
cleared  to  accept  new  data.  When 
end_of_file  is  reached,  the 


unique_eof_symbol  indicator  is  encoded  and 
the  encoding  process  is  complete. 

To  decode,  a  byte  of  the  encoded  file 
is  read  and  the  first  code  value  is 
extracted,  the  frequency  array  is  then 
scanned  to  find  the  symbol  corresponding 
to  that  code.  In  the  adaptive  version  the 
frequency  list  is  then  updated  to  include 
the  new  symbol.  The  symbol  is  written  and 
the  remaining  portion  of  the  byte  is 
processed  to  extract  the  next  code  value. 
When  the  byte  has  been  exhausted,  the  next 
byte  is  read  and  the  process  continues 
until  the  unique_eof_symbol  is  identified. 
At  this  point  any  extraneous  bits  are 
written  and  the  decoding  process  is 
complete. 


FIGURE  1 
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FIGURE  3 
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III.  Results  and  Discussion 

The  results  of  the  experiment  are  base 
on  the  application  of  both  fixed  and 
adaptive  Arithmetic  Coding  to  eight 
classes  of  100,000_byte  files.  The  files 
were  an  executable  binary  file,  a  C 
source  code  file,  a  Multimate  document 
file,  a  256  gray  level  image  file,  an 
ascii  data  file,  a  dBase  III  data  base 
file,  a  text  file  and  a  Lotus  123  spread 
sheat  file.  The  frequency  distributions  of 
four  of  these  file  are  presented  in 
Figures  1  through  4. 

The  program  implementation  is  in 
Microsoft  Quick  C  on  an  IBM  PC  XT  286 
running  DOS  3.2.  Times  are  in  seconds  per 
byte  of  uncompressed  data  and  include  all 
I/O  and  operating  system  overhead. 

Three  different  distributions  were 
used  for  the  non_adaptive  or  fixed 
method.  The  first,  a  frequency 
distribution  similar  to  that  of  the 
English  language,  the  second,  a 
distribution  generated  by  averaging  the 
actual  frequencies  of  the  eight  test 
files,  and  the  third  a  distribution 
generated  from  the  image  test  file  (Figure 
1) .  In  the  adaptive  method  the  frequency 
distribution  is  dynamic,  changing  as  each 
symbol  is  observed.  The  results  of 
encoding  and  decoding  of  each  of  the  eight 
data  files  are  demonstrated  in  Tables  I 
through  IV. 

Not  surprisingly,  the  results  in 
Tables  I  through  IV  reflect  that  the 
English  language  distribution  performed 
best  for  the  files  containing  Englishlike 
data  and  the  image  distribution  performed 
best  for  the  image  file.  The  average 
distribution  performed  surprisingly  well. 
However,  the  adaptive  method  consistently 
performed  better  than  the  other 
distributions . 
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TABLE  I 


Fixed  Model  English  Language  Distribution 


Output 

(bvtesi 

Encode  time 

Total _ per  bvte _ 

Decode  time 
_ Total _ per  bvte 

■EXE  file 

149,793 

211.0 

0.002110 

324.0 

0.003240 

C  source  program 

86,052 

135.0 

0.001350 

202.0 

0.002020 

Multimate  .DOC  file 

113,341 

193.0 

0.001930 

208.0 

0.002080 

EMCS619  .IMG  file 

126,755 

190.0 

0.001900 

259.0 

0.002590 

ASCII  data  file 

111,755 

187.0 

0.001870 

226.0 

0.002260 

dBase  III  .DBF  file 

109,761 

184.0 

0.001340 

225.0 

0.002250 

Text  file 

68,  737 

130.0 

0.001300 

189.0 

0.001890 

Lotus  123  .WKS  file 

143,727 

210.0 

0.002100 

246.0 

0.002460 

TABLE  II 

Fixed  Model 

Average  Distribution 

Output 

Encode  time 

Decode  time 

(bytes) 

Total 

Total 

per  bvte 

.EXE  file 

106,376 

162.0 

0.001620 

262.0 

0.002620 

C  source  program 

78,056 

128.0 

0.001280 

194.0 

0.001940 

Multimate  .DOC  file 

56,258 

112.0 

0.001120 

146.0 

0.001460 

EMCS619  .IMG  file 

98,062 

172.0 

0.001720 

237.0 

0.002370 

ASCII  data  file 

68,846 

131.0 

0.001310 

168.0 

0.001680 

dBase  III  .DBF  file 

67,555 

129.0 

0.001290 

167.0 

0.001670 

Text  file 

72,969 

138.0 

0.001380 

199.0 

0.001990 

Lotus  123  .WKS  file 

69,450 

136.0 

0.001360 

162.0 

0.001620 

TABLE  III 

Fixed  Model 

Image  Distribution 

Output 

Encode  time 

Decode  time 

(bvtesi  Total  per  bvte  Total  per  bvte 


.EXE  file 

131,041 

206.0 

0.002060 

288.0 

0.002880 

C  source  program 

108,047 

164.0 

0.001640 

234.0 

0.002340 

Multimate  .DOC  file 

84,206 

156.0 

0.001560 

175.0 

0.001750 

EMCS619  .IMG  file 

79,534 

145.0 

0.001490 

216.0 

0.002160 

ASCII  data  file 

109,571 

183.0 

0.001830 

213.0 

0.002130 

dBase  III  .DBF  file 

107,606 

169.0 

0.001690 

212.0 

0.002120 

Text  file 

106,758 

164.0 

0.001640 

236.0 

0.002360 

Lotus  123  .WKS  file 

101,103 

163.0 

0.001630 

195.0 

0.001950 

TABLE  IV 

Adaptive  Model 

Output 

Encode  time 

Decode 

;  time 

(bytes)  Total  per  bvte  Total  per  bvte 


.EXE  file 

87,382 

152.0 

0.001520 

183.0 

0.001830 

C  source  program 

61,510 

121.0 

0.001210 

134.0 

0.001340 

Multimate  .DOC  file 

42,371 

91.0 

0.000910 

90.0 

0.000900 

EMCS619  .IMG  file 

79,522 

139.0 

0.001390 

151.0 

0.001510 

ASCII  data  file 

52,696 

103.0 

0.001030 

99.0 

0.000990 

dBase  III  .DBF  file 

50,976 

100.0 

0.001000 

97.0 

0.000970 

Text  file 

56,209 

100.0 

0.001000 

107.0 

0.001070 

Lotus  123  .WKS  file 

50,909 

94.0 

0.000940 

104.0 

0.001040 
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ABSTRACT 

A  VAX-based  image  processing  system  has  been 
developed  for  the  digitization  and  analysis  of 
the  microvascular  system  in  the  rat  cremaster 
muscle.  These  are  images  of  blood  vessels  which 
are  less  than  one  millimeter  in  diameter.  The 
purpose  of  this  system  is  to  obtain  quantitative 
morphometric  data  on  the  microvascular  system 
which  cannot  be  easily  obtained  by  manual 
methods.  Animal  studies  have  shown  that  micro¬ 
circulation  can  be  used  in  the  detection  of 
certain  systemic  vascular  diseases  such  as  dia¬ 
betes  mellitus  and  hypertension.  These  diseases 
involve  major  disturoances  in  the  dimensions  and 
the  distributions  of  microvessels.  A  similar 
phenomenon  occurs  with  the  introduction  of 
substances  such  as  hormones  into  the  system. 
The  developed  techniques  will  be  used  to  deter¬ 
mine  the  blood  vessel  distributions  for  a  number 
of  samples.  Statistical  testing  will  then  be 
done  on  samples  of  images  comprising  diseased 
and  nondiseased  animals,  and  on  samples  of 
before  and  after  introduction  of  compounds,  to 
determine  which  image  component  parameters  best 
discriminate  diseased  and  nondiseased  samples, 
and  best  describe  the  effects  of  the  compounds 
on  the  microvascular  system. 

1.  INTRODUCTION 

The  Center  of  Applied  Microcirculatory 
Research  has  recently  been  established  at  the 
University  of  Louisville,  with  Dr.  P.  D.  Harris 
as  its  director.  The  primary  purpose  of  the 
Center  is  to  develop  microcirculation  medicine 
as  a  new  applied  discipline.  Microcirculation 
Medicine  is  a  new  clinical  arena  with  focus  on 
the  structure,  function,  pathology,  and  therapy 
of  blood  vessels  less  than  one  millimeter  in 
diameter.  Several  relevant  factors  entered  into 
the  creation  of  the  Center  at  the  University. 

Scientific  literature  documents  that  micro¬ 
vessels,  at  different  levels  in  the  same  organ, 
function  in  different  ways  for  various  purposes 
Ll,4,lU,lbJ.  Microvascular  levels  respond 
differently  to  hormones,  disease  processes,  and 
therapeutic  agents  and  procedures  L 1 ,b,6, 1 1 . 12, 
13,14,16J.  There  has  been  a  tremendous  increase 
in  knowledge,  techniques,  and  understanding  of 
microcirculatory  mechanisms  resulting  from  animal 
studies  during  the  past  EO  years,  and  studies  on 
animal  models  of  human  diseases  have  amply  shown 
that  microvascular  events  play  an  important  role 
in  the  development  of  some  of  these  disease 
processes  LluJ. 

Secondly,  clinical  sciences  now  use  little 
of  this  expanding  microcirculatory  knowledge 

L8,9J.  The  few  approaches  investigated,  such  as 
observing  the  human  microcirculation  in 
specialized  tissues  such  as  the  conjunctiva  of 
the  eye  and  the  nailfold  of  the  fingers  or  toes 
LZ,7,9,15J,  have  not  been  able  to  provide  useful 


data  on  microvascular  function  in  humans.  These 
approaches  have  not  affected  the  outcome  of 
clinical  medicine,  with  one  dramatic  exception, 
which  is  described  to  demonstrate  the  importance 
of  this  research  for  clinical  medicine. 

In  the  mid  1960's  there  was  a  severe  epidemic 
of  infectious  meningitis  in  China,  with  a  90% 
mortality  rate  for  children  under  2  years  of  age. 
The  treatment  was  a  Chinese  herbal  drug  labeled 
"654,"  whose  toxic  level  is  only  slightly  higher 
than  its  effective  treatment  level.  Tnus,  many 
children  treated  subsequently  died  from  "b54" 
toxicity.  In  1965,  a  young  Peking  clinician. 
Dr.  Rui-juan  Xiu,  put  together  a  simple  bedside 
microscope  to  assess  the  "quality"  of  blood 
perfusion  in  the  nailfold  capillaries  of 
children.  She  used  this  device  to  adjust  the 
infusion  rates  of  "654"  to  maintain  an  effective 
but  non-toxic  therapeutic  dose  in  each  sick 
child.  This  individualized  control  of  "654" 
therapy  reduced  the  infant  mortality  rate  in 
infectious  meningitis  to  less  than  10%  within  a 
three  month  period. 

The  Center  emphasizes  multidisciplinary 
research  teams,  including  researchers  from 
clinical  medicine,  basic  health  sciences,  and 
engineering.  Col laoorations  had  been  developing 
between  many  researchers  in  these  fields,  and 
this  strong  nucleus  aided  in  the  development  of 
the  Center  at  the  University  of  Louisville. 

The  Center  has  estaolished  six  project  areas 
initially  in  which  to  concentrate  its  efforts. 
Several  of  these  involve  image  processing  and 
pattern  recognition,  and  are  described  briefly. 
Under  certain  conditions,  the  venules  exhibit  a 
tendency  to  leak;  hormones  can  cause  holes  in 
the  venules,  or  abnormalities  of  the  small  veins 
in  tumors  can  contrioute  to  this  leakage 
phenomena.  Using  image  analysis  of  images  of 
the  microvascular  system  under  these  conditions, 
the  goals  are  to  measure  the  amount  of  leakage, 
the  average  rate  of  leakage,  where  the  leakage 
is  occurring,  and  the  effect  of  dosage  on  the 
leakage. 

In  the  microvessels,  white  cells  are  in  free 
flow.  However,  white  cells  may  stick  to  the 
vessel  walls,  roll  along  the  vessel  walls,  and 
may  clump  to  one  another.  This  stickiness  is  an 
early  sign  of  leukocyte  activation  during  condi¬ 
tions  such  as  infection,  tissue  transplant 
rejection,  and  systemic  vascular  diseases.  Image 
analysis  will  be  used  to  study  this  stickiness 
phenomenon  in  terms  of  measuring  the  tendency  of 
the  white  cells  to  stick,  their  clumping 

tendency,  how  rapid  the  clumps  grow,  the  number 
of  clumps  that  pass  a  certain  point  over  time, 
and,  if  they  stick  to  the  wall,  for  how  long. 
The  general  methods  developed  here  can  also  be 
used  to  study  emboli  and  thrombi. 

Using  nailfold  images  of  the  capillary 
system,  image  analysis  will  be  used  to  measure 
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the  velocity  in  the  nailfold  loops,  based  on 
plasma  gaps,  determine  the  nailfold  micro- 
circulatory  characteristics  in  various  disease 
categories,  and  identify  image-analysis 
sequences  for  quantification  of  desired  micro- 
vascular  parameters  in  nailfold  microcirculation. 

2.  CURRENT  PROJECT 

The  arterioles  in  the  microvascular  system 
are  very  dynamic  in  terms  of  dilating  or  con¬ 
stricting.  This  has  many  causes;  hormones, 
therapeutic  agents,  and  many  systemic  vascular 
diseases  such  as  hypertension  and  diabetes 
mellitus.  It  is  this  "diameter  phenomena"  which 
is  the  concern  of  this  project.  In  general,  the 
goals  of  this  project  are  to  use  image  analysis 
techniques  to  measure  the  diameters  and  lengths 
of  the  arterioles  in  a  certain  region  in  a 
tissue,  and  determine  sequential  microvascular 
changes  in  these  parameters  from  various  causes. 
Systemic  vascular  diseases  involve  major 
disturbances  in  microvessels  which  range  from  1 
millimeter  (the  large  arterioles)  down  to  0.0U3 
mm  (the  smallest  arterioles).  Animal  studies 
have  suggested  that  detectable  microvascular 
pathology  appears  very  early  in  the  development 
of  some  forms  of  diabetes  mellitus  L2J,  have 
shown  pathology  in  the  artery  wall  at  an  early 
stage  in  the  development  of  several  experimental 
forms  of  hypertension  [lOj,  and  that  treatment 
may  reverse  these  microvascular  disturbances  at 
least  in  the  early  stages.  Thus,  animal  studies 
suggest  that  observations  of  microvascular 
changes  in  humans  can  provide  very  early  detec¬ 
tion  of  some  forms  of  systemic  vascular  disease, 
as  well  as  assessments  of  the  efficacy  of  early 
therapeutic  interventions. 

Presently,  microcirculation  in  the  nailfold 
has  been  studied,  and  provides  useful  information 
on  capillary  perfusion  [7J.  However,  this 
system  does  not  visualize  the  arterioles  and 
venules  which  are  involved  early  in  systemic 
vascular  disease.  Studies  have  also  examined 
the  microcirculation  in  the  conjunctiva  (white 
of  the  eye).  However,  this  has  provided  data 
only  for  a  sparsely  arranged  vessel  network, 
with  little  parallelism  between  large  arterioles 
and  venules;  whereas  the  general  systemic  micro- 
circulation  contains  parallel  and  adjacent  large 
arterioles  and  venules.  Thus,  a  new  method  for 
microvascular  observation  that  is  truly  repre¬ 
sentative  of  the  general  microcirculation  is 
needed  in  humans  to  detect  and  to  assess  the 
treatment  of  systemic  vascular  disease. 

Currently,  this  project  involves  image 
analysis  of  the  microvascular  system  for  a 
region  of  the  rat  cremaster  muscle.  This  thin 
muscle  tissue  allows  transmitted  light  microscopy 
and  epi-i 1 lumination  fluorescent  microscopy 
simultaneously.  A  closed-circuit  television 
system,  as  shown  in  Figure  1,  can  be  used  to 
obtain  video  images  of  the  microvascular 
system.  A  new  gingival  preparation  l3J  for 
observation  of  the  microcirculation  in  the  gum 
and  lip  area  of  the  mouth  is  also  being 
developed  for  use  in  humans.  The  gingival  micro- 
circulation  has  arterioles,  capillaries,  and 
venules  which  can  also  oe  observed  by  microscopy, 
and  this  circulation  is  typical  of  many  other 
body  tissues.  Thus,  it  is  envisioned  that  when 


the  image  analysis  research  is  completed  on  the 
rat  cremaster  muscle,  it  can  be  extended  to 
humans  using  the  gingival  preparation. 


Figure  1:  Optical  System  for  obtaining 

images  of  microvascular  system  in 
the  rat  cremaster  muscle. 

3.  METHODOLOUY 

Videotapes  are  supplied  by  the  basic  health 
scientists  working  with  the  Center  showing  the 
microvascular  system  of  the  rat  cremaster  muscle 
for  normal  rats,  rats  bred  for  systemic  diseases 
such  as  hypertension,  and  for  the  microvascular 
system  before  and  after  application  of  a  sub¬ 
stance  such  as  a  hormone.  Images  from  the 
videotapes  are  obtained  using  the  VAX  Vision 
System.  The  hardware  component  of  the  VAX 
Vision  System  is  the  ITI-IP-512  digital  image 
processing  system,  whose  fundamental  components 
are  a  video  digitizer  and  a  frame  buffer.  The 
frame  buffer  contains  25bK  bytes  of  high-speed 
random  access  memory,  where  512  x  480  pixel  image 
frames  are  stored.  Individual  pixels  are  8  bits 
deep,  allowing  for  25b  gray  levels.  With  this 
system,  a  standard  RS-170  video  signal  can  be 
digitized,  stored,  and  displayed  on  a  video 
monitor  in  real  time.  The  software  of  the  VAX 
Vision  System  consists  of  VISION-SUBS,  a  series 
of  subroutines  for  controlling  the  ITI  vision 

hardware;  VAXTIPS,  an  interactive  image  manipu¬ 
lation  system;  and  VISCOM,  a  stand-alone  program 
which  must  be  run  whenever  the  VMS  operating 
system  is  booted. 

Using  the  VAX  Vision  System,  a  frame  from 
the  videotape  is  graooed,  digitized,  and  stored 
for  further  processing.  A  typical  image  is 
shown  in  Figure  2.  The  parameters  of  interest 
for  this  project  are  the  arteriole  lengths  and 
diameters  in  this  region  of  the  rat  cremaster 
muscle.  To  obtain  the  measurements  of  interest, 
standard  thresholding  techniques  were  first 
applied  to  better  define  the  vessels.  However, 
problems  arose  with  this  approach.  The  image 
contrast  is  not  very  good,  and  more  importantly, 
no  automatic  procedure  could  be  developed  to 
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distinguish  the  arterioles  (which  are  the 
vessels  being  studied)  from  the  venules  (which 
are  not  wanted  in  this  project).  Currently,  the 
approach  is  to  use  manual  cursor  inoveinent  to 
mark  points  along  the  edges  of  tne  arterioles  of 
interest.  These  points  are  then  connected, 
creating  an  outline  of  the  arterioles.  A  oinary 
image  witii  the  arterioles  filled  in  is  then 
produced,  to  wnich  a  thinning  algoritlu.i  is  used 
to  obtain  a  skeleton  of  each  arteriole. 
Starting  with  an  original  image  as  in  Figure  2, 


Figure  2:  Original  image  of  inicrovascular 

system  in  a  region  of  tne  rat 
cremaster  muscle. 


Figure  5:  Skeleton  image  of  vessel  after 

thinning 


Figure  3:  Image  after  marking  and  connecting 

of  arteriole  wall.  pseudocode,  the  f Tiling  and  thTnning 

algorithms  can  be  summarized  as  follows: 
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FILL  (NUMBER  UF  ROWS,  NUMBER  UF  CULUMNS,  OBJECT 
COLOR) 


begin 

repeat  for  each  pixel  in  IMAGE,  row  by  row 
and  column  by  column 

if  IMAGE (ROW,  COLUMN)  =  white 

if  surrounding  pixels  fit  corner 
pattern  (See  Fig.  o) 

SEED  ROW  -  row  of  pixel  inside  corner 
SEED  COLUMN  -  column  of  pixel  inside 
corner 

exit  repeat  loop 
else 

skip  to  beginning  of  next  row 
end  if 
end  if 
erd  repeat 


call  CONNECT  (SEED  ROW,  SEED  COLUMN,  LABEL 
TAG,  NUMBER  UF  ROWS,  NUMBER 
OF  CULUMNS) 

set  every  LABELed  pixel  to  white 
end 


CONNECT  (ROW,  COLUMN,  LABEL  TAG,  NUMBER  OF  RUWS, 
NUMBER  OF  COLUMNS) 

begin 

if  ROW  and  COLUMN  are  within  image  limits 
if  IHAGE(R0W,  COLUMN)  =  black 
set  corresponding  pixel  on  screen  to 
white 

IMAGE (ROW,  COLUMN)  -  LABEL  TAG 
call  CONNECT  with  the  coordinates  of 
the  four  neighbor  (See  Fig  7)  pixel 
coordinates  (the  other  parameters  stay 
the  same) 
end  if 
end  if 
end 


Initial 

Pixel 


Figure  6;  Corner  patterns  in  fill  algorithm 


4 

2 
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Figure  7:  Four  neighbors  used  in  CONNECT 

THIN20BJ  (IMAGE,  NUMBER  OF  ROWS,  NUMBER  OF 
COLUMNS) 

Degin 

display  image 

do  while  (edge  pixels  continue  to  be 
deleted) 

repeat  for  each  pixel  in  IMAGE,  row  by 
row  and  column  oy  column 
if  IHAGE(R0W.  COLUMN)  0 
DIRECTION  -  0  (See  Fig.  B) 

do  while  (initial  coordinates  are  not 
reached  again) 

if  there  are  2  to  b  white  neighbors 
in  the  8  neighbors 
if  there  is  exactly  one  black  to 
white  transition  in  the  neighbors 
if  pixel  is  a  right  or  bottom 
associated  pixel 
mark  IMAGE (ROW,  COLUMN)  to  be 
deleted 
end  if 
end  if 
end  if 

ROW  -  next  edge  pixel  row 
coordinate 

EUGE  COLUMN  -  next  edge  pixel 

column  coordinate 

end  do  while 

exit  repeat  loop 

end  if 
end  repeat 

repeat  for  each  pixel  in  IMAGE 

if  IMAGE(R0W,  COLUMN)  is  marked  to  be 
deleted 

IMAGE (ROW,  COLUMN)  -  0  (black) 
end  if 

repeat  both  above  repeats  for  left  and 
top  associated  pixels 

display  image 
end  while 
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Figure  8:  Direction  laoels  in  thinning  algorithm 

At  the  skeleton  stage,  the  segments  of  the 
arterioles  need  to  be  defined.  A  measuring 
procedure  will  then  find  the  length  and  width 
for  each  segment  (or  branch)  of  the  arterioles 
of  the  original  image.  A  segment  begins  at  an 
end  point  and  ends  at  another  end  point  or  at  a 
branching  point.  An  end  point  is-  a  pixel  with 
exactly  one  neighbor.  The  measure  procedure 
searches  the  skeleton  image  from  the  upper  left 
corner  for  tne  first  end  point.  The  skeleton  is 
then  followed  until  another  end  point  or  a 
branching  point  is  reached.  A  pixel  is 
considered  a  branching  point  if  the  skeleton 
forks  or  changes  direction  significantly.  A 
fork  is  indicated  by  a  white  to  black 
transisition  for  each  tirancn  at  the  fork  in  the 
surrounding  neighbor  pixels.  A  direction  change 
is  considered  significant  if  the  difference  in 
horizontal  or  vertical  coordinates  from  one 
pixel  to  the  next  changes  in  sign.  For  example, 
if  the  horizontal  coordinate  of  pixels  along  the 
skeleton's  path  has  been  increasing  and  suddenly 
starts  decreasing,  the  pixel  where  the  change 
occurs  is  considered  a  branching  point.  This 
latter  type  of  branching  point  check  is  needed 
because  a  fork  branching  point  only  exists  when 
a  single  segment  oranches  into  two  segments;  if 
the  skeleton  has  only  two  branches  (a  V  shape), 
a  fork  branch  does  not  exist.  As  a  segment  is 
followed,  the  pixels  passed  through  are  deleted 
from  the  skeleton  image  array.  Any  white 
neighbor  pixels  to  a  branching  point  pixel  are 
also  deleted.  These  deletions  are  necessary 
since  the  check  for  the  next  segment  also  starts 
from  the  upper  left  corner.  The  deletions  at 
branching  points  separate  the  segments 
originating  at  forks,  insuring  an  end  point  at 
the  beginning  of  each  branch.  The  previously 
detected  branch  will  not  be  considered  since  it 
has  been  deleted. 

Once  a  branch  has  been  defined, 
i.teasurements  are  made  on  it  if  it  has  a  length 
of  more  than  ten  pixels.  The  length  of  a 
segment  is  given  by  the  number  of  pixels  in  the 
skeleton  for  that  segment.  If  the  length  of  the 
branch  is  between  eleven  and  twenty  pixels,  a 
width  (diameter)  measurement  is  taken  at  the 
midpoint  of  the  oranch.  The  midpoint  is  the 
pixel  one  half  the  lenth  along  the  skeleton  from 
the  branch  start  point.  All  of  the  branches 
longer  than  20  pixels  have  width  measuretnents 


taken  every  15  pixels,  starting  15  pixels  from 

the  start  point  and  ending  15  pixels  from  the 
end  point.  If  there  are  not  15  pixels  between 
the  last  measurement  and  the  point  15  pixels 
from  the  end  of  the  branch,  a  measurement  is 
still  taken  at  the  latter  point  if  there  is  a 

difference  of  at  least  6  pixels  along  the 

skeleton  from  the  last  measurement. 

The  width  measurement  is  defined  as  the 
minimum  of  the  horizontal,  vertical,  positive 
and  negative  45  degree,  positive  and  negative 
22.5  degree,  and  positive  and  negative  67.5 
degree  bisectors  of  the  filled  image  through  the 
coordinates  of  the  measuring  point.  Each 
measurement  is  made  by  finding  the  length,  in 

pixels,  of  a  straight  line  between  the  first 
background  points  outside  the  filled  image  in 
opposite  directions  along  the  bisector  from  the 
skeleton  point  on  the  bisector.  Each  pixel 
along  the  bisector  is  tested  for  the  background 
value.  The  positive  and  negative  45  degree 
points  are  found  by  moving  one  pixel 
horizontally  and  one  pixel  vertically  until  a 
black  background  pixel  is  reached.  The  positive 
and  negative  22.5  degree  points  are  approximated 
by  moving  in  units  of  three  pixels  horizontally, 
one  pixel  vertically,  two  pixels  horizontally, 
and  then  one  pixel  vertically.  The  resulting 
bisectors  are  actually  at  21.8  and  -21.8 
degrees,  but  they  are  the  closest  approximations 
to  the  listed  angles  possible  using  a  fairly 
small  number  of  pixels.  The  positive  and 
negative  67.5  degree  bisectors  are  a  similar 
approximation,  whose  units  of  three  pixels 
vertically,  one  pixel  horizontally,  two  pixels 
vertically,  and  one  pixel  horizontally  produce 
bisectors  actually  at  68.2  and  -68.2  degrees. 

All  width  measurements  for  an  image  and  the 
number  of  measurements  with  a  particular  width, 
regardless  of  location,  are  stored  in  a  linked 
list.  A  graphical  routine  then  uses  this  list 
to  display  a  plot  of  the  length  versus  width 
measurements. 


4.  DISCUSSION  AND  RESULTS 

The  algorithms  described  above  to  obtain 
length  and  diameter  measurements  of  arterioles 
in  the  microvascular  system  of  a  region  of  the 
rat  cremaster  muscle  have  been  implemented  and 
are  working.  The  graphical  procedure  has  been 
implemented  to  obtain  distributions  of  the  total 
length  of  segments  categorized  by  the  diameter 
of  the  segment.  Nhile  currently  this  entire 
procedure  has  been  performed  on  a  few  select 
samples,  the  plan  is  to  repeat  this  process  for 
many  image  samples,  including  normal  rats, 
certain  diseased  rats,  and  for  before  and  after 
application  of  particular  substances  such  as 
hormones.  The  changes  in  these  distributions 
will  then  be  analyzed.  For  example,  one 
conjecture  is  that  for  a  low  dose  of  a  hormone, 
the  smaller  diameter  arterioles  alone  constrict, 
and  the  total  length/diameter  distribution  will 
show  a  shift  at  the  low  diameter  range  only.  As 
the  dosage  is  increased,  the  diameters  of 
arterioles  affected  increases.  The  long  range 
goal  is  to  correlate  the  shift/change  in  the 
total  length/diameter  distribution  with  an 
effective  dosage  level. 
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For  hypertension,  animal  studies  show 
pathology  in  the  artery  wall  at  an  early  stage 
in  the  development  of  several  experimental  forms 
of  hypertension  [lOj.  In  hypertension,  only 
large  arterioles  appear  to  be  involved  during 
the  very  early  phase,  and  smaller  arterioles  are 
progressively  involved  at  a  later  stage.  Animal 
studies  have  also  suggested  that  detectable 
microvascular  pathology  appears  very  early  in 
the  development  of  some,  and  maybe  all,  forms  of 
diabetes  mellitus  [2J.  The  goal  of  this  project 
is  to  use  these  changes  in  length/diameter 
distributions  to  monitor  the  progression  of 
these  diseases  in  humans,  as  well  as  to  assess 
the  efficacy  of  early  therapeutic  interventions. 

In  sumnary,  this  project  uses  image 
analysis  techniques  to  obtain  parameters 
(lenght,  diameter)  of  the  microvascular  system 
in  the  rat  cremaster  muscle,  with  the  goal 

being  to  correlate  the  changes  in  the 

length/diameter  distributions  (which  classes  of 
arterioles  change  size,  and  how  much)  with  the 
dosage  of  a  particular  compound  or  the 

progression  of  some  systemic  vascular  diseases. 
The  techniques  to  obtain  the  needed  parameter 
estimates  have  been  developed,  and  soon  various 
sample  distributions  will  be  obtained,  analyzed, 
and  compared.  The  gingival  preparation  will 
then  be  used  to  extend  these  techniques  for 
humans. 
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AN  EMPIRICAL  BAYES  DECISION  RULE  OF  TWO-CLASS  PATTERN  RECOGNITION 


Tze  Fen  Li  and  Dineah  S.  Bhoj,  Rutgers  University  at  Camden 


Abstract 

In  the  pattern  classification  problem,  it  is  known 
that  the  Bayes  decision  rule,  which  separates  two 
classes  of  patterns  gives  a  minimum  probability  of 
misclassification.  In  this  study,  the  conditional  den¬ 
sity  functions  are  known,  but  the  prior  probability  of 
each  class  is  unknown.  A  set  of  past  observations  (or 
a  training  set)  of  unknown  classes  is  used  to  estimate 
the  unknown  true  prior  probability  and  hence  is  used 
to  construct  an  empirical  Bayes  decision  rule,  which 
separates  two  classes  and  which  can  make  the  prob¬ 
ability  of  misclassification  arbitarily  close  to  that  of 
the  Bayes  rule.  The  results  of  a  Monte  Carlo  simula¬ 
tion  study  are  presented  to  demonstrate  the  favorable 
prior  estimation  and  the  classification  performed  by 
the  empirical  Bayes  decision  rule. 

Key  words  and  phrases:  classification,  empirical  Bayes, 
pattern  recognition. 


1.  Introduction 

Essentially,  there  are  two  different  approaches 
to  solving  classification  problems.  One  approach  is  to 
find  a  Bayes  decision  rule,  which  separates  two  classes 
based  on  the  present  observation  X  and  minimizes  the 
probability  of  misclassification  [3,7],  This  approach 
requires  aufl3cient  information  about  the  conditional 
density  function  f{x  |  u;)  of  X  given  class  w  and  the 
prior  probability  p(u;)  of  each  class,  otherwise,  the 
conditional  density  and  the  prior  probability  have  to 
be  estimated  through  a  set  of  past  observations  (or  a 
training  set  of  sample  patterns  with  known  classes). 
On  the  other  hand,  if  very  little  is  known  about  statis¬ 
tical  properties  of  the  pattern  classes,  a  discriminant 
function  can  be  used,  A  learning 

automation  and  an  algorithm  are  designed  to  of  the 
discriminant  function.  After  learning,  this  function 
is  used  to  separate  pattern  classes  [1,2, 6, 6).  For  this 
approach,  it  is  not  easy  to  define  the  fimctional  form 
and  its  parameters.  Moreover,  the  discriminant  func¬ 
tion  after  learning  will  not  be  able  to  give  the  min¬ 
imum  probability  of  misclassification.  In  this  study, 
the  first  approach  is  applied  to  solving  two-class  pat¬ 
tern  problem. 

The  conditional  density  functions  f{x  \  u;)  are 
known,  but  the  prior  probability  of  each  class  is  un¬ 
known.  A  set  of  n  past  observations  of  unknown 
classes  is  used  to  estimate  the  true  unknown  prior 
probability  and  this  is  used  to  construct  a  decision 
rule,  called  an  empirical  Bayes  (EB)  rule  [4],  which 
is  used  to  separate  two  classes.  It  will  make  the  prob¬ 
ability  of  misclassification  arbitarily  close  to  that  of 
the  Bayes  rule.  The  results  of  a  Monte  Carlo  simu¬ 
lation  study  with  one-dimensional  distributions  are 
presenled  to  demonstrate  the  favorable  estimation 
of  the  xmknown  prior  probability  and  the  boundary 
point  of  two  classes  made  by  the  EB  decision  rule. 

2.  Classification  Of  Two  Classes 


function,  where  u;  =  ci  denotes  class  1  and  w  =  C2 
denotes  class  2.  Let  9  be  the  prior  probability  of 
w  =  Cl-  Let  d  be  a  decision  rule.  A  simple  loss 
function  is  used  such  that  the  loss  is  1  when  d  makes 
a  wrong  decision  and  the  loss  is  0  when  d  makes  a 
right  decision.  Let  R{9,d)  denote  the  risk  function 
(the  probability  of  misclassification)  of  d.  Let  L  and  U 
be  two  rtfgions  separated  at  a  point  z  by  the  decision 
rule  d  in  the  domain  of  X,  i.e.,  d  decides  ci  when 
X  €  X.  and  decides  C2  when  X  6  U.  Then 

f  6f{x\ci)dx+  [  {I  -  9)f[x  \  C2)dx 

Jv  J  L 


w 

Let  D  be  the  family  of  all  decision  rules  which  sepa¬ 
rate  two  classes.  For  9  fixed,  let  minimum  paobability 
of  misclassififation  be  denoted  by 


R(9)  =  inf  R{9,d) 

>ieD  ‘ 

(2) 

A  decision  rule  do  which  satisfies  (2)  is  called  the 
Bayes  decision  rule  with  respect  to  the  prior  9  and 
given  by 

‘i»W=c.  ■/  tf/(z|c,)>(l-«)/(i|cj) 

=  C2  Otherwise 


In  the  empirical  Bayes  (EB)  decision  problem  [4], 
the  past  observations  m=l,2,...4i  and  the 

present  observation  X  are  i.i.d..  The  EB  decision 
problem  is  to  establish  a  decision  rule  based  on  the 

set  of  past  observations  X„  =  (Xj . X„).  This  can 

be  constructed  as  using  X^  to  select  a  decision  rule 
which  determines  whether  the  present  obser¬ 
vation  X  belongs  to  c,  or  cj.  Let  p(i„  ]  9}  be  the 
marginal  density  of  Xm  with  respect  to  the  prior  dis¬ 
tributions  of  classes,  i.e., 


P(=r:„,  I  S)  =  S/(x„,  1  c)  -H  (1  ^  «)/(!,„  1  C2). 

We  divide  the  interval  [0,1]  into  k  subintervals 
and  a  finite  discrete  distribution  <l>  is  placed  on 

0  e  [0,1]  such  that  =  0,)  =  1,  where  S,  is  the 
middle  point  of  the  i-th  aubinterval  i|,t  -  1, ...,  k. 
Let  h[9,  I  z„)  be  defined  liy 


I  I.,)  = 


n.’Ui  I  '’•) 


^  l  rim^l  r{lm  I  M 


1 . k 


which  is  the  conditional  probability  of  6,  given  X„  = 
z„.  The  conditional  expectation  |  X„)  was  shown 
[8]  to  converge  a.s.  to  a  point  ^  e  |0,  1|  with  |  tf  -  p  j 
<  j  with  respect  to  the  true  prior  probability  p{u’  - 
Cl)  i  /i.  Our  EB  decision  rule  is  obtained  by  replacing 
the  unknown  9  in  (3)  by  EjS  |  X„]  and  is  written  as 


Let  X  be  the  present  observation  which  belongs 
to  one  of  two  classes  a  and  c,.  Consider  the  decision 
problem  consisting  of  determining  whether  X  belongs 
to  Cl  or  Cj.  Let  f{z  |  w)  be  the  conditional  density 


<(A'..)(.V)  -  c,  ./ 


/(Xl^) 

/(X|c,j 


>  E[9  I  X„)  ‘‘ 


1 


(*2 


(5) 
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The  £B  decision  rule  (5)  is  used  to  separate  two 
classes  and  the  simulation  results  will  be  presented 
in  the  next  section. 

3.  Simulation  Results 

In  this  section,  we  generate  a  set  of  observations 
(a  training  set  )  Xn  which  are  used  to  estimate 
the  true  prior  probability  n  of  class  1  and  establish 
an  empirical  Bayes  rules  for  each  of  three  cases. 
Each  £B  rule  will  determine  a  boundary  point  of 
two  classes.  A  normal  distribution  and  a  uniform 
distribution  are  used  to  be  the  conditional  density 
function  f{x  \  u?): 

Class  1:  N(0,1)  U(0.5) 

Class  2:  N(2,l)  U{1.0) 

The  prior  distribution  is  unknown.  For  the  normal 
distribution,  400,500  and  600  observations  are  gener¬ 
ated  from  an  IBM-PC  microcomputer  and  the  per¬ 
centage  of  observations  from  class  1  is  ^  =  0.3,0. 5  and 
0.7.  The  simulation  results  are  given  in  Table  1.  A 
set  of  600  observations  gives  a  satisfactory  estimation 
E\6  \  iCi]  of  the  true  prior  probability  /i,  which  are 
0.3003,  0.4995  and  0.6925  respectvely.  The  boundary 
points  of  two  classes  determined  by  the  Bayes  rule 
and  the  EB  rule  are  also  given  in  Table  2.  Table  2 
shows  the  boundary  points  provided  by  the  EB  rule 
which  are  close  to  that  of  the  Bayes  rule. 

Table  I  The  Estimation  of  E[B  (  Normal  distri¬ 
bution  with  means=0  and  2, and  equal  variance=l.  Uniform 
didtribution  with  mean8=0.5  and  1.0. 

True  Prior  No.  of  E[e  \  X„] 

fj,  observations  Normal  Uniform 


0.3 

■100 

0.3002 

0.3001 

0.3 

500 

0.3002 

0.3005 

0.3 

000 

0.3003 

0.300' 

0..5 

■100 

0.4680 

0.4976 

0..5 

500 

0.4950 

0.4993 

0.5 

600 

0.4095 

0.4993 

0.7 

400 

0.6204 

0.6948 

0.7 

500 

0.666.2 

0.6955 

0.7 

600 

0.6925 

0.6962 

For  the  uniform  distribution,  400,  .SOO  and  600  observa¬ 
tions  are  generated  for  Q  -  0.3, 0.5  and  0.7.  The  simulation 
results  are  also  given  in  Table  1.  The  set  of  600  observations 
tions  gives  a  satisfactory  estimation  of  the  true  prior  /i,  which 
are  0.3001,  0.4903  and  0.6062  respectively.  The  boundary 
points  of  two  classes  are  given  in  Table  3.  Table  3  shows  that 
the  boundary  point  determined  by  the  EF3  rule  is  the  same 
as  that  of  the  Hayes  rule. 


Table  2.  The  boundary  points  of  two  classes  given  by 
the  Bayes  rule  and  the  EB  rules  for  normal  distributions. 


True  prior  /i 

Bayes  rule 

EB  rule 

0.3 

0.5764 

0.5771 

0.5 

1,0000 

0.9990 

0.7 

1.4237 

1.4060 

Table  3.  The  boundary  points  of  two  classes  given  by 
the  Bayes  rule  and  the  BE  rules  for  uniform  distributions. 

True  prior  ^ 

Bayes  rule 

EB  rule 

0.3 

0.5000 

0.5000 

0.5 

0.5000 

0.5000 

-1.0000 

-1.0000 

0.7 

1.0000 

1.0000 

Note:  0.5000-1.0000  means  that  the  boundary  point  can  be 
anywhere  between  .5  and  1. 
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STATISTICAL  MODELING  OF  A  PRIORI  INFORMATION  FOR  IMAGE  PROCESSING  PROBLEMS: 

A  Mathematical  Expression  of  Images 

Z.  Liang,  Duke  University  Medical  Center 


ABSTRACT 

A  general  mathematical  expression  of  images  is  presented 
intended  to  reflect  the  intrinsic  probabilistic  information  of 
image  density  distribution,  in  terms  of  a  priori  image  (or  source) 
protability  density  functions.  It  strongly  resembles  the 
entropy  form  defined  by  Kullback  and  Leibler  and  has  the 
defined  contents  of  a  priori  source  distribution  probabilistic 
informatiwi.  The  expression  reduces  to  the  form  of  Shannon's 
entropy  if  a  uniform  a  priori  source  probability  distribution  is 
assumed.  A  Bayesian  analysis  incorporating  the  a  priori  source 
probabilistic  informatitm  is  studied  in  treating  observed  data 
obeying  Poisson  statistics.  A  system  of  equations  determining 
the  Bayesian  solution  is  given  which  maximizes  the  a  posteriori 
probability  given  the  otMrved  data.  A  Bayesian  imaging  algo¬ 
rithm  approaching  to  the  solution  iteratively  is  derived  by 
employing  an  expectation  maximization  technique.  Tests  of  the 
Bayesian  algorithm  with  uniform  and  non-uniform  a  priori 
probabilities  are  carried  out  for  computer  generated  ideal  data 
and  experimental  phantom  imaging  data  containing  Poisson 
noise.  Good  quality  images  are  obtained.  Preliminary  study  of 
maximizing  the  a  priori  source  distribution  probabilistic  infor¬ 
mation  is  also  presented. 

INTRODUCTION 

Statistical  modeling  of  image  processing  pro.'leras  of  ill- 
posed  in  inverse  process  [l]  has  been  enhanced  in  recent  years 
by  use  of  the  maximum  entropy  (ME)  [2-3]  and  the  maximum 
likelihood  (ML)  [4-5]  analysis.  Although  some  effort  has  been 
made  to  consider  both  the  source  entropy  and  data  likelihood 
information  [6-7J,  statistical  modeling  of  the  image  processing 
problems  has  not  yet  been  extensively  investigated.  For  that 
purpose,  a  statistical  model  of  a  priori  source  distribution  proba¬ 
bilistic  information  has  been  proposed  intended  to  reflect  the 
intrinsic  probabilistic  information  of  source  distribution  [8]. 
The  model  considers  the  statistcal  behavior  of  individual  source 
element  and  incorporates  the  a  priori  source  information  via 
maximum  entropy  analysis.  This  statistical  model  has  now 
been  developed  to  consider  two  general  classes  of  a  priori  source 
probabilistic  information  in  consistent  with  the  random  pnwess 
of  source  (or  image)  density  distribution:  (a),  uniform  and  (b). 
non-uniform  a  priori  image  probability  distributions. 
Mathematical  expressions  of  the  statistical  model  of  images  con¬ 
taining  the  uniform  and  non-uniform  a  priori  probabilistic 
information  are  formulated.  The  image  expressions  strongly 
resemble  the  entropy  forms  defined  respectively  by  Shannon  19) 
and  Kullback  et  al  [10]  and  have  the  defined  content  of  a  prion 
image  density  probability  distribution.  These  formulas  of  a 
priori  source  distribution  probabilistic  information  imply  a 
principle  of  maximum  a  prion  probability  (PMAPP),  of  which 
the  Jaynes’  maximum  entropy  principle  (MEP)  [11)  may  be  a 
special  case.  A  Bayesian  analysis  incorporating  the  a  priori 
source  probabilistic  information,  where  the  data  likelihood 
(probability)  function  is  assumed  to  reflect  the  Poisson  statistics 
of  photon  detection,  is  studied.  Other  likelihood  functions  of 
uncorrelated  and  correlated  Gaussian  data  are  given  in  Appen¬ 
dix  A.  A  system  of  equations  determining  the  Bayesian  solu¬ 
tion  IS  derived  which  maximi7.es  the  a  posteriori  probability 
given  the  olnerved  data.  A  Bayesian  imaging  algorithm 
approaching  to  the  solution  iteratively  is  derived  bv  employing, 
among  many  other  iterative  schemes  [12-13],  the  expectation 
maximiration  (HM)  technique  [14].  As  a  simple  example  rather 
than  using  the  EM  technique,  the  steepest  descent  method  [12] 
IS  used  for  the  Bayesian  solution  as  shown  in  Appendix  B. 
Preliminary  study  of  maximizing  the  a  priori  .source  probability 
using  the  lagrange  parameter  technique  [6]  and  the  recursive 
Picard  method  [15]  are  given  respectively  in  Appendices  (’  and 
I).  Tests  of  the  EM  Bayesian  algorithm  and  other  iterative  algtv 
rithms  derived  in  the  Appendices  with  the  uniform  and  non 


uniform  a  priori  source  probabilistic  information  are  carried  out 
for  computer  generated  ideal  data  and  experimental  phantom 
imaging  data  containing  the  Poisson  noise.  Good  quality  images 
are  obtained.  A  Altering  criterion  function  is  used  to  quantita¬ 
tively  indicate  the  convergence  performance  of  the  iterative 
Bayesian  algorithm. 

A  PRIORI  INFORMATION  FUNCTIONS 

The  source  distribution  region  is;  as  usual  in  digital  image 
processing,  divided  into  J  source  elements  (or  voxels).  Each 
voxel  has  an  average  value  over  its  volume,  1, 
;=1,  2,_y  .  In  nuclear  isotope  imaging,  stands  for  the 
Gamma  photon  emission  from  voxel  j  per  unit  time  at  time-O; 
in  X-ray  imaging,  it  represents  the  attenuation  density  of  voxel 
j ;  in  optical  picture  processing,  it  is  the  radiance  value  of 
voxel  j :  in  scanning  electron  microscopes,  it  is  the  transmit¬ 
tance  of  voxel  j  ;  and  in  NMR  imaging,  it  reflects  the  intensity 
of  voxel  j  in  the  spectrum  space.  In  the  following  sections, 
<i> J  is  referred  to  generally  as  the  «rength  or  density  of  voxel 
j  ■ 

If  the  source  strengths  |i^^  )  are  hypothetically  quantized 
into  strength  units  (or  photon  ’'balls"),  then  represents  the 
number  of  the  strength  units,  or  the  photon  balls.  If  the  '.otal 
number  of  strength  units  N  =  can  be  assumed  to  be 

Axed,  the  source  strength  distribution  can  then  be  characterized 
as  a  random  process  in  which  the  N  strength  units  distribute 
randomly  over  the  J  voxels.  Let  represent  the  a 

priori  probability  of  a  strength  unit  falling  into  voxel  y  ,  the  a 
priori  source  distribution  probability  is  then  expressed  as  [8t 

•  (d 

The  P  (iji)  of  function  (1)  is  the  a  priori  source  probability 
function.  It  reflects  a  statistical  random  process  of  image  den¬ 
sity  distribution  considering  a  priori  probability  distribution 
information  The  underlying  assumptions  of  func¬ 

tion  (1)  are  very  general  and  function  (1)  can  be  applicable  in 
many  image  processing  problems. 

By  the  deAnition  of  probability  (i.e,  with  total  N  density 
units,  there  are  units  falling  into  voxel  J 

It  IS  assumed  that  the  a  priori  probability  may  be  approximated 
as: 

(2) 

where  represents  the  a  priori  mean  strength  of  voxel  j  . 
The  larger  the  value  \  the  higher  the  information  content  of 
data  statistics,  and  so  the  closer  will  approach  to  .  The 
estimation  of  is  quite  important  for  the  optimal  solution  of 
\'i>,  I  given  the  observed  data.  Thit  will  be  discussed  later. 

If  a  maximum  a  priori  probability  principle  applies,  then 
maximizing  the  probability  function  /’(<!>)  is  equivalent  to 
maximizing  the  log  function; 

II  (4>)  =  InP  (<1>)  =  in  t/V  !)  -f  2^  <t>,lnp,  U ,  )  -  In  (0,  I)  ] 

y 

=  IniN  \)  +  <t>  ,ln  (N  )-  ln(<t>,<^  \.  (.3) 

J 

Using  'he  .Stirling's  formula 

In  (iV  !)  -  N  In  (A  )  -  A/ 

and  the  contraint  of  A'  =  <#>, .  function  ( J)  hecome.s: 
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(4) 


//(«>)  =  -  In(^). 

J 


Function  (4)  is  the  general  mathematical  expression  of 
images  containing  the  a  priori  image  density  distribution  proba¬ 
bilistic  informatim  under  the  principle  of  maximum  a  priori 
probability.  It  strongly  resembles  the  entropy  form  defined  by 
KuIIback  and  Leibler  [10]  and  has  the  defined  contents  of  image 
density  ^  1  and  the  a  priori  mean  information  1$  ^  1 . 

If  the  a  priori  probability  distribution  is  uni¬ 

form,  ie,  =  ^  =  N  /J  ,  function  (4)  reduces  to: 

NW  =  -Z4>,  In  +  N  In  {N  /J).  (5) 


Under  the  PMAPP,  function  (5)  strongly  resembles  the 
entropy  form  defined  by  Shannon  [9]  plus  a  constant.  It  has 
the  defined  contents  of  image  density  I  and  implies  the 
assumption  of  uniform  a  priori  probability  distribution.  An 
image  information  analysis  can  be  carried  out  in  a  similar  way 
as  Shannon’s  communication  analysis  [9].  An  m  X  n  image  can 
be  represented  by  a  point  in  m  x  n  multidimensional  space. 
The  mapping  from  a  point  of  image  in  the  m  x  n  image  space 
to  multidimensional  data  space  can  be  ideally  assumed  as  one  to 
one  mapping.  The  distortion  and  noise  contamination  in  the 
measurement  process  Mure  the  point  in  the  data  space  to  be  a 
small  region.  The  inverse  mapping,  therefore,  produces  more 
than  one  point  (or  a  region)  in  the  image  space.  Different  objec¬ 
tive  criteria  impose  different  constraints  on  the  inverse  map¬ 
ping  and  on  the  selection  of  the  corresponding  point  in  the 
image  region  of  m  x  n  dimensions.  Function  (1)  reflects  then 
the  probability  distribution  of  the  images  over  the  image  region. 
Under  the  principle  of  maximum  a  priori  probability,  function 
(4)  can  be  a  measure  of  the  closeness  of  l<fi.  I  to  1^^  |  and 
function  (5)  the  measure  of  the  probable  distribution 
configurations  of  10^1  over  the  image  region  in  the  m  x  n 
dimensional  space.  The  most  likely  point  in  the  image  region  of 
m  X  n  dimensions  is  the  most  likely  distribution  of  the  image 
density  )  over  the  m  x  n  voxels.  In  image  processing,  an 
image  having  density  distribution  la  <f>j  I  may  not  be  distinct 
from  the  image  having  density  {b  <b j  1  (where  a  and  b  9*0 
are  constants).  In  this  case,  the  images  are  said  to  be  degen¬ 
erate.  The  degeneracy  is  reflected,  in  the  m  X  n  multidimen¬ 
sional  space,  as  a  line  passing  through  the  origion  of  coordinates. 

As  commonly  assumed  in  most  imaging  applications,  the 
mappmg  is  linear  and  can  be  expressed  as  (l6t 

>■.  =  .  1=1.2.-,/  (6) 

J 

where  ()',  I  are  the  data  elements  and  can  be  represented  as  a 
point  in  the  /  -dimensional  data  space;  e,  is  the  noise  com¬ 
ponent  and  R,j  the  probability  of  receiving  an  image  density 
unit  from  voxel  j  for  measurement  point  i  (or  projection  ray 
}',  ).  In  image  restoration  application,  R.j  is  the  point  spread 
function  (PSF)  of  the  imaging  system  [ifit 

■Since  the  noi.se  components  (e,  1  are  unpredictable,  the  sta 
tistical  property  of  data  fluctuation  would  be  considered  m 
terms  of  probability  distribution.  If  each  data  element  >', 
obeys  Poi.sson  statistics  around  the  mean  21/  ^ ,  and  all  the 

data  elements  !)',  1  are  uncorrelated  with  each  other,  the  data 
probability  distribution  is  [4,17-lftt 

/’  lY  I  ♦)  =  n  'LK.j  </’;  )  '  />'.'•  '7) 

,=)  J  / 

The  r  (Y  i  <I>)  of  function  (7)  i.s  the  data  probability  func¬ 
tion.  U  reflects  the  Poisson  nature  and  uncorrelated  fluctuations 
of  measurements.  It  is  noted  that  the  means  of  I 

are.  of  course,  correlated  with  each  other. 

Other  probability  dtstributions  of  data  vector  Y  have  been 
c<»nsidered  in  the  previous  work  [ih]  and  are,  as  references, 
given  in  .Appendix  A. 

The  likelih<KKl  function  of  the  d-ta  dLsinbuiion  is 
e.<  pressed  as  f  I  7- 1 


=  El-  <i>j  +  y.  in  (£/?,/  (F, !)  ] .  (8) 

■  >  J 

Function  (8)  is  a  measure  of  the  likelihood  of  which  the 
source  distribution  |0 y  }  would  have  given  rise  to  the  observed 
dau  obeying  the  Poisson  statistics. 

A  Bayesian  analysis  providing  the  maximum  a  posteriori 
solution  is  studied  in  the  following  section.  It  considers  both 
the  likelihood  character  of  dau  fluctuations  (8)  and  the  a  priori 
source  probabilistic  information  (4).  A  discussion  on  consider¬ 
ing  the  a  priori  source  information  (4)  and  the  linear  dau  con¬ 
straints  of  Eiqs.(6)  is  given  in  Appendix  C,  where  the  Lagrange 
parameter  technique  [6]  is  employed. 

BAYESIAN  ANALYSIS  AND  ALGORITHM 
feyesian  analysis  provides  a  maximum  a  posteriori  solu¬ 
tion  <I>’  which  considers  both  the  daU  sutistics  and  the  a  priori 
source  distribution  probabilistic  information.  From  Bayes"  Law: 

/■(♦lY)  =  />  (Y  I  <I>) /’ (<^)  /  F  (Y) ,  (9) 

the  Bayesian  function  is  given  by: 
g  (<1>)  =  InP  (<H  Y)  =  InP  (Y  I  If)  +  InP  (<1>)  -  InP  (Y) .  (10) 


Considering  the  Poisson  nature  of  dau  fluctuations  (8)  and 
the  a  priori  probabilistic  information  of  source  distribution  (4), 
the  Bayesian  function  is: 

g  (it)  =  /.  (Y  I  *)  -t-  //  (<^)  -  InP  (Y) 

+  F,  (22 /?,/</>;)  ]  (M) 

'  /  I 

-21[</>/  /$/)]  +  C(Y) 

/ 

where  C(Y)  =  —  lnP(Y)  —  In  (F, !)  is  independent  of  <t‘. 
.Since  C  (Y)  does  not  effect  the  determination  of  the  Bayesian 
solution  <!>■  which  maximizes  function  g  (4>),  it  will  be  omitted 
later. 


A  system  of  equations  determining  the  Bayesian  solution 
<t>  is  derived  by  maximizing  the  Bayesian  function  g  (<I>), 


I 


0 


(F,  /  222?.y  0/')  -  =  in  ((/.;)-  In  (^^  )  4  I  .(12) 

‘  /  < 


Since  g  (<!))  is  strictly  concave,  i.e,  for  any  non-vanishing 
vector  /  ,  there  is: 


7.  ’[  v'g  (0)  ]  /  =  -  [  2:>-,  (I/?,/  //  /  IR,/  0/  y 

‘  ;  J 

+  '0|  1  <  0. 

/ 

the  solution  <t>'  is  uniquely  determined  by  the  Lqs.(l2). 

A  Bavesian  algorithm  carrying  out  the  calculation  of  the 
solution  ij)  Iteratively  is  derived  by  employing,  among  many 
other  iterative  schemes  [12-13],  the  KM  technique  [4,8,14,19] 


01' 


=  0,'" 


Z«a(F, 

Z/?,i 


(13) 


and 


dll  (*)  ] 
d0i 


(14) 


where  €  =  1  and  S,*’’  ’  =  0/'"  ’  -  0/"  are  a.ssumed  for  easy 
compuution,  and  ’  is  an  adjustable  sigmoiiiil  parameter 
chosen  to  gradually  impose  the  effetl  of  the  a  prion  informa 
lion  //  (ih): 


a 


r.  ) 


A  n 

/r  + 


(l.M 


/.  (Yl<h)  =  /n/>(YI0) 


with  A  .  li .  and  r  consunt 
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Note  that  vt  hile  the  approximation  of 
<t>ii  -  *  +  c  Sj"  '  in  Eq.(l4)  is  assumed  for  easy  computa¬ 

tion,  other  approximations  can  be  used.  The  gradually  increase 
of  f/"  '  as  a  function  of  iterative  index  n  is  quite  important  for 
optimal  results  but  the  values  of  A ,  B ,  and  T  can  vary 
significantly  from  those  used  in  this  paper. 

(a) .  For  the  uniform  a  priori  probabilistic  information  (5): 

Zi*''  '  =  In  (</)/"  '  +  €  6/’  ‘)  +  1  :  (16) 

(b) .  For  the  non-uniform  a  priori  nformalion  (4): 

Z/"  '  =  In  [(f)!:''  '  +  €  8/"  >)  -  /n  ($<  )  -(■  1  .  (17) 

The  Bayesian  algorithm  of  Fiq.(13)  considering  the  uni¬ 
form  and  non-uniform  a  priori  information  (16)  and  '17) 
respectively  will  be  applied  to  computer  generated  idea,'  J.ita 
and  experimental  phantom  imaging  data  containing  I'oisson 
noise  in  the  following  section.  A  discussion  on  maximizing 
g  (<J>)  via  the  steepest  descent  technique  [l2]  will  be,  as  a  simple 
example,  given  in  Appendix  E  The  discussions  on  emphasizing 
the  a  priori  P  (tt)  are  given  in  .Appendices  C  and  D  respectively 
by  use  of  the  data  constraints  (6)  and  (8). 

RESULTS 

The  BIP  algorithm  of  Eq.(13)  considering  (16)  and  (17) 
respectively  is  tested  in  two  different  imaging  situations:  (a), 
computer  generated  noise-free  data,  where  one  dimensional 
image  restoration  and  two  dimensional  image  restoration  and 
reconstruction  are  considered;  and  (b).  computer  generated  and 
experimental  phantom  imaging  data  containing  Poisson  noise, 
where  the  similar  tests  as  in  Ca)  are  carried  out.  To  facilitate 
the  calculation  of  the  algorithm  and  the  test  function  men¬ 
tioned  below,  only  one  dimensional  results  of  convergence  per¬ 
formance  of  the  algorithm  are  reported.  Multidimensional  cal¬ 
culation  IS  straightforward.  Preliminary  results  using  the 
iterative  algorithms  derived  in  the  .Appendices  are  also  reportedL 

(i).  one  dimensional  image  restoration  results 

In  the  case  of  noise-free  data,  the  actual  source  distribu¬ 
tion  |S^  )  consists  of  two  point  sources  of  57  strength  units 
each,  separated  by  8  voxel  units,  superimposed  upon  a  uniform 
background  of  3  strength  units,  as  shown  in  Fig.l  by  the  solid 
line.  The  ideal  data  (noise-free)  {F,  1  are  calculated  from 
F,  ~  Bij  Sj  ,  as  shown  in  Fig.l  by  the  dotted  line,  where 
the  functional  form  of  \R,j  1  is  assumed  as: 

with  T  =FWllM  /2  =  4.5  voxel  units,  as  shown  in  Fig.l  by 
the  broken  line.  It  reflects  a  quite  poor  spatial  resolution  imag¬ 
ing  system. 


/' 


30  39 

Fig.l  Source  distribution  (solid  line),  noise-free  data  (dotted 
line)  and  point  spread  function  (broken  line). 

Fig.2  compares  the  results  of  the  a  priori  uniform  HIP 
algorithm  (u)  of  l;qs.(l3)  and  (16)  (dotted  line)  and  the  a  prion 
non-uniform  BIP  algorithm  (n)  of  Fqs.{l  3)  and  (17)  (solid  line) 
after  50  iterations  for  the  ideal  data.  The  initial  estimate 
and  the  mean  vcFies  I  are  chosen  a,s: 

=  Y,jZB,,  .  and  5,  =  riS;"’  (19) 

where  t)  ?e  is  a  constant.  The  values  of  A  =  \,  B  =  100. 
T  =  1  and  T)  =  5  are  chosen. 
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Fig.2  Comparison  of  the  a  priori  uniform  (u,  dotted  line) 
and  non-uniform  (n,  solid  line)  BIP  algorithms  after  50  itera¬ 
tions  for  the  noise-free  data. 

In  order  to  quantitatively  evaluate  the  performance  of  the 
a  priori  uniform  and  non-uniform  BIP  algorithms,  a  test  func¬ 
tion  of  rooi-mean-square  criterion  is  used, 

'/'o  =  [  £(<#•)"  '-S,?/  -  s  f  (20) 

;  / 
where  .8  is  the  mean  of  |S^  I. 

Fig.3  shows  the  results  of  i/in  as  a  function  of  iterative 
index  n  .  Since  the  neighboring  voxels  around  the  two  point 
sources  play  a  dominant  role  in  the  results  of  the  test  function, 
a  smoothing  weight  filtering  is  applied  before  using  the  test 
function.  In  another  words,  the  test  function  is  modifed  as 
[13> 

>('1  =  1 s,)^/I(s’, -s)^]^  (21) 

/  / 

where  the  weighting  process  is  expressed  as: 

Sj  =  S,  /  ,i  =  j  -2,  j  j  ,  j  j  -f2  (22) 

and 


Wj.2,j  =0.2,  =0J  (23) 

=1-0.  »’;+i,;  =0.5,  =0.2 

and  similarly  for  \ 

The  results  of  ifi,  are  shown  in  Fig.4.  The  modified  test 
function  reflects  more  accurately  the  performance  of  conver 
gence  of  the  iterative  algorithms  since  it  considers  both  the 
amplitude  and  spatial  components.  .As  shown  later,  such 
improvement  with  the  modified  test  function  is  more 
significant  when  the  data  is  noisy. 


1  ig.3  Results  using  test  function  (20)  lor  the  BIP  algorithms 
in  the  ca.sc  of  noise-free  data. 

In  the  ca.se  of  experimental  imaging  data,  a  simple  one 
dimensional  equivalent  phantom  is  prepared  by  threadmg  two 
parallel  catheters  (separated  bv  about  7  voxel  units)  containing 
a  solution  Co-  through  a  stainless  steel  screen.  Two  dimen¬ 
sional  data  IS  obtained  by  imaging  the  phantom  using  a  Picker 
Dyna  Uamera  \lidel  No.4  without  collimator  and  is  arranged  as 
a  .32x  32  matrix,  with  the  two  lines  ol  tubing  oriented  in  the 
volume  direction  its  shown  in  Fig.l. 3.  Row  16  of  the  data 
matrix  is  inilicated  in  Fig. 5  by  the  stars  and  is  used  as  the  one 
dimensional  imaging  daU  1)’,  1.  ^jeglecllng  the  effect  of  the 
finite  length  of  the  luhing.  O’,  1  can  be  viewed  as  imaging  data 
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Fig.4  Results  using  test  function  (21)  in  the  noise-free  case. 


from  a  double  element  source  distribution  |5;  (  (the  projection 
of  the  two  line  sources  along  the  parallel  direction  superim¬ 
posed  on  a  uniform  bacl^round)  in  a  one  dimensional  geometry, 
where  the  means  are  Sj  ). 

I  is  obtained  by  imaging  a  point  source  of  at  the 
same  depth  as  the  phantom  (as  shown  in  Fig.t4)  and  is  formed 
as  a  matri.x  using  the  technique  [161.  One  row  of  |  is 
shown  in  FigJ  by  the  solid  line  in  which  the  center  value  is 
normalized  to  40. 


Fig.6  compares  the  results  of  the  BIP  algorithm  with  the  a 
priori  uniform  (dotted  line)  and  non-uniform  (solid  line)  infor¬ 
mation  after  50  iterations  for  the  experimental  imaging  data 
containing  Poisson  noise. 


The  convergence  performances  of  the  BIP  algorithms 
using  the  test  function  (20)  are  shown  by  Fig. 7.  The  results 
using  the  modified  test  function  (21)  are  shown  by  Fig.8. 
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I  ig.6  Comparison  of  the  a  prion  uniform  (u,  doited  line) 
and  non-uniform  (n,  solid  line)  BIP  algorithms  after  ,50  itera¬ 
tions  for  the  experimental  phantom  imaging  data  containing 
Poi.sson  noise. 


(ii).  two  dimensional  image  restoration  results 

In  the  case  of  noise-free  data,  the  actual  source  distribu¬ 
tion  \S j  I  IS  .shown  by  Fig.9.  It  consists  of  two  point  sources  of 
109  strength  units  each,  sep.'e-.ted  by  8  voxel  units,  superim 
posed  upon  a  uniform  background  of  1  strength  unit.  A  two 
dimensional  PSF-  of  liq.fls)  is  a.<Bumed.  l  ig.lO  shows  the  noise 
data  di.strihiition  calci  ’  ted  from  the  convolution  of 
I-  f  'S-ll  shows  the  result  using  the  a  prion  uni 
form  BIP  algorithm  after  100  iurations  for  the  noi.se-free  data. 


Fig.7  Results  using  test  function  (20)  for  the  BIP  algorithms 
in  the  case  of  experimental  phantom  imaging  noisy  data. 


I  ig.8  Results  using  test  function  (21)  in  the  case  of  noisy 
data. 

The  result  of  the  a  priori  non-uniform  BIP  algorithm  for  the 
noise-free  data  after  100  iterations  is  shown  in  Fig.12. 

For  the  experimental  phantom  imaging  tests,  Figs.13  and 
14  show  respectiyely,  as  mentioned  before,  the  experimental 
imping  noisy  data  IT,  )  from  a  phantom  containing  two  paral¬ 
lel  lines  of  tubing  and  the  point  spread  function  of  the  camera 
system  from  which  )  is  formed  [16), 


Fig.9  Two  dimensional  source  distribution  consisting  of  two 
point  sources,  superimposed  upon  a  uniform  background. 


l  ig.lO  I  wo  dimensional  noi.se-free  data  distribution,  calcu 
lated  from  the  convolution  of  the  source  distribution  of  1  ig.9 
and  a  two  dimensional  PSI  of  l  q.(  18). 

l  ig.ll  Result  of  the  a  prion  uniform  BIP  algorithm  after 
KKi  iterations  for  the  noise-free  data. 

lig.1.5  shows  the  result  of  the  a  priori  uniform  BIP  algo 
riilim  after  2.5  iterations  for  the  phantom  imaging  data.  The 
r-esult  of  the  a  priori  non-uniform  BIP  algorithm  alter  25  itera 
lions  for  the  noisy  data  is  shown  by  lig.16. 
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Fig.l2  Result  of  the  a  priori  non-uniform  BIP  algorithm 
after  100  iterations  for  the  noise-free  data. 


Fig.l3  Two  dimensional  experimental  phantom  imaging  data. 


Fig.l4  Two  dimensional  experimental  point  spread  function. 
Fig.15  Result  of  the  a  priori  uniform  BIP  algorithm  after  25 
iterations  for  the  experimental  phantom  imaging  data  con¬ 
taining  Poisson  noise. 


Fig.l6  Result  of  the  a  priori  non-uniform  BIP  algorithm 
after  25  iterations  for  the  experimental  phantom  imaging 
noisy  data. 

Fig.17  An  elliptical  phantom  containing  four  hot  spots  and  a 
cold  spot,  superimposed  upon  a  uniform  background. 

(iii).  two  dimensional  image  reconstruction  results 

Fig.17  shows  the  actual  source  distribution  15^  I  consist¬ 
ing  of  four  hot  spots  of  4  strength  units  each  and  a  cold  spot  of 
2  strength  units,  superimposed  upon  a  uniform  background  of  3 
strength  units.  Outside  the  elliptical  region,  the  density  is  zero. 
The  rectangular  region  is  divided  into  64x  64  voxels  (or  pixels 
in  two  dimensions). 

The  projection  rays  are  calculated  from  IX  /  /  I 

noise-free  projections)  for  parallel  beam  geometry  using  64 
equal  projection  angles  in  the  interval  [O,  ISO]  degrees,  where 
R,i  is  the  intersection  length  of  projection  ray  i  and  voxel  j  . 
Fach  projection  contains  64  equally  spaced  petition  rays, 
lach  of  the  noise- free  projection  ray,  say  ray  i  ^X/  ^i/  -*/ 
input  to  a  Poisson  random  number  generator  [20].  The  gen¬ 
erated  Poisson  randomized  projection  ray  T,  then  has  the  mean 
X/^i/-5/-  .Since  those  Poisson  randomized  projection  rays 
If,  I  with  zero  mean  are  set  to  zero,  the  Poisson  randomized 
projection  rays  with  non-vanishing  means  are  in  the  range 
from  I  to  120  counts.  The  summation  of  the  noise-free  projec 
lion  rays  is  356341.94  and  the  total  counts  of  the  Poisson  ran 
ilomi/ed  projections  is  356946. 

Iigs.l8  and  19  show  respectively  the  results  of  the  a 
prion  uniform  and  non-uniform  HIP  algorithms  after  10  itera¬ 
tions  for  the  noise-free  projections.  The  results  obtained  by 
applying  the  BIP  algorithms  to  the  Poisson  randomized  projec 
tions  after  10  iterations  are  shown  by  f  igs. 20  and  21  respec 
tively. 

Preliminary  results  of  the  iterative  algorithms  derived  in 
Appendices  B,  ('  and  13  are  show  n  in  the  following: 

Fig.22  compares  the  results  of  the  descent  algorithm 
(Iiqs.(B.2),  (B.3)  and  (B.4))  with  the  a  prion  uniform  (u.  dotted 
line)  and  non-uniform  (n,  solid  line)  information  after  25  itera 


tions  for  the  noise-free  data  shown  in  Fig.l. 


Fig.23  compares  the  results  of  the  descent  algorithm  used 
for  Fig.22  after  50  iterations  for  the  noise-free  data. 


Fig.l 8  Result  of  the  a  priori  uniform  BIP  algorithm  after  10 
iterations  for  noise-free  projections. 

Fig.l9  Result  of  the  a  priori  m  u-uniform  BIP  algorithm 
after  10  iterations  for  noise-free  projections. 


Fig.20  Result  of  the  a  priori  uniform  BIP  algorithm  after  10 
Iterations  for  Poisson  randomized  projections. 

Fig.21  Result  of  the  a  priori  non-uniform  BIP  algorithm 
after  10  iterations  for  Poisson  randomized  projections. 
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l  ig.22  Comparison  of  the  a  priori  uniform  (u,  doited  line) 
and  non-uniform  (n,  solid  line)  descent  algorithms  after  25 
iterations  for  the  noise-free  data  shown  in  Fig.l. 
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Fig.23  C'omparison  of  the  a  priori  uniform  (u,  dolled  line) 
and  non-uniform  (n,  solid  line)  descent  algorithms  after  50 
iterations  for  the  noise-free  data. 

In  computer  implemenution  of  the  l.agrange  algorithm  of 
lqs.((.'.7)-(C.9),  uneven  convergence  performance  is  observed- 
I'he  dolled  lines  in  Figs.24  and  25  show  the  results  of  the  a 
prion  uniform  lagrange  algorithm  after  12  and  13  iterations 
respectively  for  the  noise-free  data  shown  in  I  ig.l.  After  the 
1 3th  iteration,  no  improvement  is  obtained.  The  solid  lines  in 
1  igs.24  and  25  show  the  results  of  the  a  prion  non  uniform 
lagrange  algorithm  of  liqs.((\4)-(C.6)  after  15  and  25  iterations 
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respectively  for  the  noise-free  data.  Smooth  convergence  per¬ 
formance  with  the  a  priori  non-uniform  Lagrange  algorithm  is 
observed. 

Fig.26  shows  the  results  of  the  a  priori  non-uniform 
Picard  algorithm  of  Eq.(D.2)  after  10  (dott^  line),  25  (broken 
line)  and  50  (solid  line)  iterations  for  the  noise-free  data  shown 
in  Fig.l.  Since  the  a  priori  uniform  information  (5)  is 
emphasized  in  the  a  priori  unifra-m  Picard  algorithm  of 
Eq.(D.3),  relative  flat  solutions  are  obtained,  as  expected,  in 
which  the  two  point  sources  in  Fig.26  are  no  longer  resolved. 


Fig.24  Comparison  of  the  a  priori  uniform  (u)  l.agrange  algo¬ 
rithm  after  12  iterations  (dotted  line)  and  the  non-uniform 
(n)  Lagrange  algorithm  after  18  iterations  (solid  line)  for  the 
noise-free  data. 


Fig.25  Comparison  of  the  a  priori  uniform  (u)  I^igrange  algo 
rithm  after  13  iterations  (doited  line)  and  the  non-uniform 
(n)  Lagrange  algorithm  after  19  iterations  (solid  line)  for  the 
noise- free  data. 


Fig.26  Results  using  the  a  priori  non-uniform  Picard  algo¬ 
rithm  after  10  (dotted  line),  20  (broken  line)  and  50  (solid 
line)  iterations  for  the  noise-free  data. 

DISCUSSION 

This  paper  presents  a  statistical  image  model  intending  to 
reflect  the  intrinsic  probabilistic  information  of  image  density 
distribution.  The  image  model  is  formed  based  on  the  very  gen¬ 
eral  assumptions  of  the  discretization  of  image  density  units 
and  the  random  distribution  process  of  the  image  units  over  the 
voxels.  Under  the  assumptions,  the  intrinsic  probabilistic  infor¬ 
mation  of  image  density  distribution  is  expressed  mathemati¬ 
cally  as  function  (1).  The  probabilistic  information  function 
(1)  can  be  treated  either  as  an  additional  a  priori  source  infor¬ 
mation  to  supplement  the  data  likelihood  solution  or  as  a  max¬ 
imum  criterion  considering  the  constraints  of  data  mea.sure- 


ments.  Incorporating  additional  a  priori  source  information  into 
the  Bayesian  image  processing  (BlI^  formalism  for  a  solution  of 
maximum  a  posteriori  probability  has  been  discussed  previously 
[18,13,8],  as  well  as  the  Bayesian  algorithm  section  and  Appen¬ 
dix  B  in  this  paper.  Maximizing  a  priori  source  probabilistic 
information  in  treating  data  measurements  as  constraints 
implies  the  principle  of  maximum  a  priori  probability  as  men¬ 
tioned  in  the  introductory  section  of  this  paper.  Preliminary 
study  on  the  maximum  a  priori  source  probability  subject  to 
data  constraints  has  been  reported  in  references  [21,131  as  well 
as  the  Appendices  C  and  D  in  this  paper. 

Under  the  PMAPP,  the  probabilistic  information  function 
(l)  may  be  deflned  as  a  measure  of  the  image  information  con¬ 
tent  if  the  assumptions  are  applicable.  This  can  be  easely  seen 
in  the  following  simple  examples: 

(a) ,  maximizing  function  (l)  under  the  assumptirai  of 
uniform  a  priori  source  probability  distribution  (i.e,  func¬ 
tion  (5))  without  data  constraints  (or  all  measurements 
are  identical)  results  in  a  uniform  image; 

(b) .  maximizing  function  (l)  with  the  non-unifoiiU  a 
priori  source  probability  distribution  without  data  con¬ 
straints  produces  a  non-uniform  image  having  density  dis¬ 
tribution  proportional  to  the  a  priori  mean  values  |. 
These  examples  are  in  consistence  with  the  assumptions  of 

the  statistical  image  model  function  (1)  and  reflect  two  extreme 
cases  of  the  minimal  and  maximal  a  priori  information  content. 

The  examples  can  be  more  clearly  understood  in  the  mul¬ 
tidimensional  image  space.  A  m  x  n  dimensional  image  is 
represented  by  a  point  in  a  m  X  n  dimensional  space.  Example 
(a)  spttifies  a  point  on  the  diagonal  line  (if  images  [o  )  are 
indistinguishable  from  image  1  with  any  constant  a  .  or 
degeneracy  exists)  in  the  multidimensional  space.  Example  (b) 
produces  a  jxrint  on  the  line  defined  by  the  vector  ♦.  How  far 
an  image  point  can  be  brought  away  from  the  site  on  the  diagp: 
nal  line  and  aproaching  to  the  point  |  on  the  line  ♦ 
depends  on  the  data  measurements.  The  distoration  and  inevit¬ 
able  noise  in  data  measurements  prevent  the  image  point  from 
reaching  the  point  l^j  ].  The  distoration  can  be  removed  in 
image  processing  by  a  suitable  algorithm.  However,  the  noise 
effects  can  not  be  removed  in  the  image  processing.  It  defines  a 
disk  range  in  the  plane  prependieular  to  the  line  ®  and  passing 
through  the  point  r^rmalization  of  processed  images  in 

an  iterative  image  processing  may  be  helpful  for  convergence  to 
the  disk  range.  A  well-conditioned  algorithm  considering  the 
noise  property  may  produce  an  image  point  within  the  disk 
range.  For  that  propose  of  considering  both  the  data  statistics 
and  a  priori  source  information,  the  BIP  formalism  [18]  has 
been  developed  to  consider  the  pattern  source  information  [22- 
23l  the  source  distribution  boundary  condition  [24]  and  the 
image  enhancement  from  estimated  a  priori  information  [25]. 

It  is  noted  that  the  information  functions  (4)  and  (5) 
reflect  two  extreme  cases.  Other  information  content  between 
them  has  been  investigating  [8].  The  application  of  the  PMAPP 
to  the  a  priori  source  information  [8,18,22-25]  is  staightforward. 

APPENDK  A.  The  Likelihood  Functions  of  Gaussian 
Data 

if  each  data  element  K,  obeys  Gaussian  statistics  around 
the  mean  value  <i> ,  and  all  the  data  elements  |>',  1  are 

uncorrelated,  then  the  data  probability  distribution  is: 

r  (Y  I  <t>)=n(27r(T,^)''‘exp  [-(F,  Y/ncr,^)]  (A.l) 

I  =1  J 

and  the  likelihood  function  is 

Z.(YI<^)  =  £[-(!',  >‘’'^(2cr,^)  (A.2) 

«  J 

—  ^  In  (2Trcr,^)  ]  . 

If  the  data  elements  O',  1  are  correlated  with  correlation 
parameters  ),  function  (A.l)  becomes: 

/•(YID)  =  nca  (X,,  ,(T,  .tr,  )  X  (A. 3) 
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'  ‘  J  J 

and  (A.2)  becomes: 

^  (Y 1  *)  =  B-  ^  (K.  -ZRu  t'j )  (K,  -Z/?o  ) 

II  I  I  j  J 

-  InCu  (.)  ] .  (A.4) 

Most  regulari2atiQn  techniques  [26-29]  can  be  derived 
from  the  Bayesian  analysis  considering  the  data  likelihood 
function  (A.2)  or  (A.4)  and  the  generic  a  priori  source  informa¬ 
tion  [13,18]. 

APPENDIX  B.  Consideration  of  Steepest  Descent  Tech¬ 
nique 

Maximizing  the  a  ptsteriori  probability  P  (<I>  I  Y)  by  use  of 
the  steepest  descent  technique  [12]  is  expressed  as: 

gf*)  =  A  (Y  I  <I>)  + /f  (*)  -  inP  (Y)  (ai) 

=  D  -  Z^ij  <t>j  +  J",  In  (£«,,  0^  )  ] 

'  j  J 

''?/)]  + C(Y)  =  maximum  . 

J 

Since  g  (4>)  is  strictly  concave,  the  iterative  steepest  des¬ 
cent  scheme  is  then  given  by: 

0/"*"  =  0/"'  +  adi<»>  (a2) 

‘ft*"  ’  =  Vi  «  (<^)  (a3) 

=  Z^a  (y,  /  "LRij  -  1)  -  (n  (0/"  ’  /  0,  )  -  1 

'  J 

and 


Eq.(C.2)  gives  the  solution  <I>. 

If  the  uniform  a  priori  probabilistic  information  (5)  is 
considered  as  a  special  case,  the  solution  |0y  }  of  Eqs.(C.2),  (05) 
and  (06)  with  replacement  of  0 ^  by  1. 

APPENDIX  D.  Consideration  of  Recursive  Picard  Tech¬ 
nique 

The  maximum  a  posteriori  probability  solution  <I>'  is  given 
by  the  system  of  equations  (12),  or 

0;  =  0^  exp  [  (  Y,  /£/?,,  0;  -  1  )  -  1  ] .  (D.1) 

'  J 

introducing  an  adjustable  parameter  fi  for  the  data  con- 
staints  and  using  the  recursive  Picard  technique  [I5l  an  itera¬ 
tive  Picard  scheme  can  be  expressed  as: 

0j<"  =  0,  exp  [  M  Z^a  (  y.  *  -  1)  -  1  ]  (D.2) 

!  1 

For  the  special  case  of  the  uniform  a  priori  probability 
distribution  information  (5),  Eq.(D.2)  reduces  to: 

0i"  =  exp  [  M  (  Y,  /£P„  0 -  1  )  -  1  ]  .  (D.3) 

'  J 
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P.O.  Box  1892 
Houston  TX  77251 
tom®stat5.  rice.edu 
(713)  975-1173 

Keegcl  John  C. 

University  of  DC 
1740  Hobart  St.  NW 
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Route  202-206N,  Bldg  M 
Somerville  NJ  08876 

(201)  231-3486 

Minor  James  M. 

Du  Pont  Company 
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(202)  885-3149 

Mode  Dr.  Charles  J. 

Department  of  Math,  and  Computer  Sci. 
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New  York  NY  10024 
(212)  799-8212 

Pierce  Alan 

Amoco  Production  Company 
4502  East  41st  Street 
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